Data, Algorithms, and Powerset
Powerset finally launched their search engine last night. It's been over two years in the making, and the company has much hype surrounding their natural language technology.
The new search engine has received mixed reviews. For example, while Mike Arrington at TechCrunch fell in love at first sight, the comments following the post tell a very different story. The most balanced review I've seen so far is from Danny Sullivan at Search Engine Land.
Disclosure: My company Kosmix also builds search technology, albeit of a very different kind. I don't consider Kosmix in any way to be a PowerSet competitor, and in this post I'm wearing my blogger hat and not my Kosmix hat. Moreover, this post is not really about PowerSet at all, but about data and algorithms, using PowerSet as the example du jour.
To boil down both my personal experience playing with PowerSet and what I'm seeing in the reviews: some of the features (like the Factz) are cool. The biggest weakness is that PowerSet today searches only two data sets: Wikipedia and Freebase. And while these are both very large and very useful datasets, they're not nearly enough to answer many real-world user queries.
The TechCrunch comments and Danny Sullivan's posts both contain examples that point out the strengths and weaknesses of PowerSet's search. Here's an example from my own personal experience: in my previous post, I used the phrase "new wine in an old bottle". I was curious about the origin of this phrase (since I had also heard people say "old wine in a new bottle" -- which came first?) So I typed the search "origin of expression new wine in old bottle" into both PowerSet and Google. Google nailed it in the first result (from Yahoo Answers), while Powerset was lost. Ditto for "How old was Gandhi when he was killed?"
While people can argue about the applicability of natural language processing to search, to the quality of PowerSet's implementation, and so on, I have a much simpler point of view. I believe PowerSet has some pretty cool IP in its algorithms and has done as good a job as they possibly can with it. The problem is, they don't have enough data to work with. (Another way of saying this is, PowerSet's index is a SubSet of Google's index.)
The primary reason Google's search is useful is that there is lots and lots of data on the web, and Google indexes so much of it. Yes, Google's search algorithms are fantastic, but they wouldn't work anywhere as well if they didn't have so much data underneath them. Consider my search about new wine and old bottles. The reason Google nails it is because there's a page on the web that uses the exact phrase I typed into the search box. Same for the Gandhi example. The cool thing for Google is, people who search often use phrases in the same was as people who write web pages (especially on community sites such as Yahoo Answers). Instead of doing any NLP, they just need to index a really huge corpus of text and look for near-exact matches.
To use a phrase from an earlier post on this blog, "More data usually beats better algorithms". And nowhere is that more true than it is in search. (The natural question to ask is, why not both more data and better algorithms? The answer in this case is that many of the techniques seem to rely on the structure and general cleanliness of Wikipedia and Freebase, so it's not clear how well they will scale to the web as a whole.)
What of PowerSet? I for one really love their Wikipedia search, and will use them to search Wikipedia and Freebase. As Danny Sullivan points out in his post, perhaps the right business model for PowerSet lies in enterprise search rather than in web search.


Recent Comments