Powerset finally launched their search engine last night. It's been over two years in the making, and the company has much hype surrounding their natural language technology.
The new search engine has received mixed reviews. For example, while Mike Arrington at TechCrunch fell in love at first sight, the comments following the post tell a very different story. The most balanced review I've seen so far is from Danny Sullivan at Search Engine Land.
Disclosure: My company Kosmix also builds search technology, albeit of a very different kind. I don't consider Kosmix in any way to be a PowerSet competitor, and in this post I'm wearing my blogger hat and not my Kosmix hat. Moreover, this post is not really about PowerSet at all, but about data and algorithms, using PowerSet as the example du jour.
To boil down both my personal experience playing with PowerSet and what I'm seeing in the reviews: some of the features (like the Factz) are cool. The biggest weakness is that PowerSet today searches only two data sets: Wikipedia and Freebase. And while these are both very large and very useful datasets, they're not nearly enough to answer many real-world user queries.
The TechCrunch comments and Danny Sullivan's posts both contain examples that point out the strengths and weaknesses of PowerSet's search. Here's an example from my own personal experience: in my previous post, I used the phrase "new wine in an old bottle". I was curious about the origin of this phrase (since I had also heard people say "old wine in a new bottle" -- which came first?) So I typed the search "origin of expression new wine in old bottle" into both PowerSet and Google. Google nailed it in the first result (from Yahoo Answers), while Powerset was lost. Ditto for "How old was Gandhi when he was killed?"
While people can argue about the applicability of natural language processing to search, to the quality of PowerSet's implementation, and so on, I have a much simpler point of view. I believe PowerSet has some pretty cool IP in its algorithms and has done as good a job as they possibly can with it. The problem is, they don't have enough data to work with. (Another way of saying this is, PowerSet's index is a SubSet of Google's index.)
The primary reason Google's search is useful is that there is lots and lots of data on the web, and Google indexes so much of it. Yes, Google's search algorithms are fantastic, but they wouldn't work anywhere as well if they didn't have so much data underneath them. Consider my search about new wine and old bottles. The reason Google nails it is because there's a page on the web that uses the exact phrase I typed into the search box. Same for the Gandhi example. The cool thing for Google is, people who search often use phrases in the same was as people who write web pages (especially on community sites such as Yahoo Answers). Instead of doing any NLP, they just need to index a really huge corpus of text and look for near-exact matches.
To use a phrase from an earlier post on this blog, "More data usually beats better algorithms". And nowhere is that more true than it is in search. (The natural question to ask is, why not both more data and better algorithms? The answer in this case is that many of the techniques seem to rely on the structure and general cleanliness of Wikipedia and Freebase, so it's not clear how well they will scale to the web as a whole.)
What of PowerSet? I for one really love their Wikipedia search, and will use them to search Wikipedia and Freebase. As Danny Sullivan points out in his post, perhaps the right business model for PowerSet lies in enterprise search rather than in web search.
Full disclosure: I'm an employee of Powerset and a former employee and shareholder of Kosmix =)
More data could certainly change lots of things within Powerset. For example, imagine when we have Factz from all over the Web instead of just Wikipedia. Frequency, authority of the source, and all sorts of other factors can play into what Factz we display.
However, unlike traditional search methodologies, Powerset can achieve very high precision, since we're trying to match the meaning of your query to the meaning of a sentence in Wikipedia. Of course, a sentence stating what you're looking for needs to occur and that doesn't happen as much in a small corpus like Wikipedia as it would on the entire Web.
I do hope that you continue to use Powerset when you're searching Wikipedia. That's success for us, at least in the short term.
Posted by: Mark Johnson | May 13, 2008 at 12:22 AM
To reappropriate Clinton: It's the product, stupid.
Powerset's enhanced Wikipedia just isn't that enhanced. Most of the time I get the right result right away on Google, but there are plenty of times when Powerset fails utterly (try, "Who was the tenth President of the US?"). Why would I switch in the scenario?
The product itself is lackluster. Coupled with the fact that Powerset has been billing itself as the "natural language search company" for the last two years and I'm left scratching my head.
Where's my revolutionary new search paradigm go?
Posted by: Jesse Farmer | May 13, 2008 at 01:28 PM
"More Data Usually Beats Better Algorithms." You're in good company there. Peter Norvig's presentation at startup school this year (2008) was pretty much about that.
Posted by: tim wee | May 18, 2008 at 07:35 PM
Anand, did you check what has happened to your google query on new wine? A quantum effect, it seems.
Posted by: Vinay | June 16, 2008 at 10:17 AM
Vinay: Great catch! Datawocky is now the first result for this query. This points to one of the weaknesses of pagerank: a reasonably high pagerank combined with an exact text match can beat out really relevant results.
Posted by: Anand Rajaraman | June 16, 2008 at 12:30 PM