Note: This post is about a new product we're testing at my company Kosmix.
Search engines are great at finding the needle in a haystack. And that's perfect when you are looking for a needle. Often though, the main objective is not so much to find a specific needle as to explore the entire haystack.
When we're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:
- Hiking the Continental Divide Trail.
- A loved one recently diagnosed with arthritis.
- You read the Da Vinci code and have an irresistible urge to learn more about the Priory of Sion.
- Saddened by George Carlin's death, you want to reminisce over his career.
The web contains a trove of information on all these topics. Moreover, the information of interest is not just facts (e.g., Wikipedia), but also opinion, community, multimedia, and products. What's missing is a service that organizes all the information on a topic so that you can explore it easily. The Kosmix team has been working for the past year on building just such a service, and we put out an alpha yesterday. You enter a topic, and our algorithms assemble a "topic page" for that topic. Check out the pages for Continental Divide Trail, arthritis, Priory of Sion, and George Carlin.
The problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. As I've written before, the web has evolved from a collection of documents that neatly fit in a search engine index, to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases.
Here are some of the challenging problems that we had to tackle in building this product:
- Figuring out which which applications are relevant to a topic. For example, Boorah, Yelp, and Google maps are relevant to the topic "restaurants 94041". WebMD, Mayo Clinic, and RightHealth are relevant to "arthritis". If we called each application for every query, the page would look very confusing, and our partners would get unhappy very quickly! I'll write more on how we do this in a separate post by itself, but it's very, very cool indeed.
- Figuring out related topics in the Related in the Kosmos section on each Topic page. For example, you can start from the Priory of Sion and laterally explore Rosslyn Chapel or the Madonna of the Rocks.
- Figuring out the placement and space allocation to each element in the 2-dimensional grid. Going from one dimension (linear list) to two dimensions (grid) turns out to be quite a challenge, both from an algorithmic and from a UI design point of view.
In this alpha, we've taken a first stab at tackling these challenges. We are still several months from having a product that we feel is ready to launch, but we decided to put this public alpha out there to gather user feedback and tune our service. Many aspects of the product will evolve between now and then: Do we have the right user interaction model for topic exploration? Do we put too much information on the topic page? Should we present it very differently? How do we combine human experts with our algorithms?
Most importantly, the Kosmix approach does not work for every query! Our goal is to organize information around topics, not answer arbitrary search queries. How do we make the distinction clear in the product itself? Can we carve out a separate niche from search engines?
We hope to gain insight into all these and more questions from this alpha. Please use it and provide your feedback!
Greetings,
This is the first time I heard abotu Kosmix, and two paragraph down in your blog post I was already sold. I completely identify with the problem you're trying to solve -- it is a pain point and I'm very glad to see somebody has decided to work on it and address it. All the power to you guys! I will write about you.
Posted by: Nasser Manesh | June 26, 2008 at 10:05 PM
Just out of curiosity, how does this differ from what ask.com does (http://www.ask.com/web?q=george+carlin&search=search&qsrc=0&o=0&l=dir) or powerset (http://www.powerset.com/explore/go/george-carlin)?
Posted by: Jeremy | June 26, 2008 at 10:30 PM
I tried "Storage Deduplication", which is the industry I work in. And I have to say that kosmix results are way better than ask and powerset's. Good job!
Posted by: Anon | June 26, 2008 at 11:45 PM
It looks like Glue page from Yahoo India
Posted by: White Eagle | June 27, 2008 at 02:25 AM
Jeremy: Take a look at the pages from Ask, Powerset and Kosmix for the same topic e.g., George Carlin, or arthritis. That should explain the difference.
White Eagle: Yahoo Glue has the same idea in terms of 2-dimensional layout. The difference is in the details -- the number and variety of the kinds of information that shows up in each topic page. And the algorithms that decide what should how up on each page.
Posted by: Anand Rajaraman | June 27, 2008 at 08:10 AM
Anand,
I really like your blog and follow it regularly. You've some interesting insights into search and advertising. I've played with kosmix a few times.
But the fundamental problem that I face with kosmix like sites is that there is too much noise-to-signal ratio. What I like about Google ( and eventually other search engines have followed Google's design ) is the simplicity and a lower noise-to-signal ratio.
If the user is exposed to too much of information, in this case it would be *clickable* links, IMO, the avg probability of each link getting clicked would be very low. For instance I queried for "lake chelan" as I'm interested in finding a log cabin for the 4th July weekend.
I get a zillion links that tell me various different things at kosmix.
http://www.righttrips.com/Travel/lake_chelan-s
Where as in google, there is an ad at the top that talks about reservations ( I will click it if I'm interested to make a reservation ) and there is a visitor center link in the algo results ( that I will click if I'm in a research mode on what to see )
http://www.google.com/search?q=lake+chelan&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
The problem that Google tries to solve is to minimize this noise-to-signal ratio as much as possible and that is fundamental aspect of ranking to show the user as much relevant info at the top so that he can click and get away from Google to carry on his business.
You can probably characterize this ratio in terms of the dwell time of the user when shown the search results and number of clicks he has to make to reach his desirable destination. IMO, Google scores low on both dwell time and the # of clicks compared to other search engines, making it a very simple user experience.
I guess with so many information sources you've, you could try to build a personalized search engine only exposing those to the user that he might be interested in. You can change his current experience for a given query based on what he did for the previous query, dynamically adapting to his online behaviour.
Posted by: Krishna | June 27, 2008 at 11:00 AM
Krishna: Thanks for your comment, it raises an important point. If your goal is purely to make reservations, then you are looking for a needle (the reservation page), and Google or Yahoo search is indeed your best bet.
On the other hand, the Kosmix topic page gives you an immersive 360 degree view of Lake Chelan: images, videos, what people are saying, and so on. Exploration always entails some noise, that goes with the territory! And it's part of the fun too.
http://kosmix.com/topic/lake_chelan?
BTW, the RightTrips link you posted still uses our old product, not the new alpha, so that's not what this article is about. We'll be migrating RightTrips to the new platform soon.
Posted by: Anand Rajaraman | June 27, 2008 at 11:54 AM
Hello - ran into a blog a week or so ago and found your articles very interesting - very interesting problems that you discuss.
I just tried a couple of searches on kosmix - both obvious, but one of them a loaded word (e.g. Java). The results were quite good and pertinent but I also think that the amount of information generated was overwhelming as the page seemed "too busy" (i.e. too much presented in one page).
Now, as an end user, I may end up being interested in lot of that info, but if I had no idea of that topic to begin with, I may want to "gentler introduction". For example, I may want a synopsis, then be able to drill down to get more info. Here it was sort of like floodgates were opened immediately and I still had the job to sort through stuff - which I would love to have offloaded on to your app (granted all of the info is much more meaningful than what you get in a websearch).
Anyway, excellent blog!
Arun
Posted by: Arunk | June 27, 2008 at 01:05 PM
Anand,
I came here after seeing a reference on GigaOM to the needle-and-haystack analogy. I think the truth is that whether we want to zero in or something or explore depends on the situation. For instance, I feel an overload on my nervous system with too much stuff on the Net, and follow either a "need to know" basis, or some days, I go "random walking" like one does on mountain holidays -- and discover lovely little trails of knowledge.
Key point: An algorithm is a good way to organise the stuff, but can only go so far. I would settle for a fine mix between the Wiki and Kosmix models with the former using the latter...of course, as a media guy, I never believed in Loser Generated Content, which is what I call UGC. Would you like user generated algorithms?
cheers. Madhavan.
Posted by: Madhavan | June 27, 2008 at 07:04 PM
Very cool product. The idea is sound, and the execution in terms of UI is good IMO.
However, the relevance of your results isn't that good when the query isn't too unique. I think you'll have to work hard on disambiguation. Like Krishna commented above, the signal to noise ratio is important, and when the topic queried on is ambiguous, there's a lot on the page that isn't relevant to me.
I'm really interested on how you're figuring out which applications are relevant to the topic. I had a related problem in my startup (hivesight.com), and would love to compare solutions.
Posted by: Elad Kehat | June 28, 2008 at 06:37 AM
> the Kosmix approach does not work for every query ... How do we make the distinction clear in the product itself? Can we carve out a separate niche from search engines?
Get acquired by Microsoft or Google and let them worry about when it works :)
Posted by: Aneesh | June 28, 2008 at 08:41 PM
Being able to submit feedback on each part of the results would be nice. "Was this page helpful?" I have found that some parts are very helpful and others are so wrong I hesitate to say the page was helpful at all. If I could pinpoint the part that wasn't helpful and reinforce the part that was, I'd be more willing to provide feedback after searching.
Posted by: Jason Adams | June 29, 2008 at 05:19 PM
Jason: That's a great observation. We plan to allow feedback on parts of the page shortly. Agree it would be very useful.
Posted by: Anand Rajaraman | June 29, 2008 at 06:09 PM
The problem we're solving is fundamentally different from search
No, it's not fundamentally different from search. It is fundamentally different from what search has become over the past 5 or 10 years. But prior to the rise of internet search engines, "search" itself, as a discipline or field of study (i.e. "information retrieval") did include exactly what you are doing here.
Posted by: jeremy | July 01, 2008 at 03:01 PM
Wow, this is such a cool product. It has features that are similar to Clutsy.com which I have started using (It has a two dimesional model). I'm keen to see how this product will be developing and will definitely be making more use of it as I get very irritated with search engiens like Google - with their poor relevancy sometimes.
Posted by: Harsha | October 16, 2008 at 02:54 PM