Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising

Kosmix Adds Rocketfuel to Power Voyage of Exploration

Kosmix_logo_betaish  

Today I'm delighted to share some fantastic news. My company Kosmix has raised $20 million in new financing to power our growth. Even more than the amount of financing, I'm especially proud that the lead investor in this round is Time Warner, the world's largest media company. Our existing investors Lightspeed, Accel, and DAG participated in the round as well. The Kosmix team also is greatly strengthened by the addition of Ed Zander as investor and strategic advisor. In an amazing career that spans Sun Microsystems and Motorola, Ed has repeatedly demonstrated leadership that grew good ideas into great products and businesses. His counsel will be invaluable as we take Kosmix to the next level as a business.

In these perilous economic times, the funding is a big vote of confidence in Kosmix's product and business. Kosmix web sites attract 11 million visits every month, and we have a proven revenue model with significant revenues and robust growth. RightHealth, the proof-of-concept we launched in 2007, grew with astonishing rapidity to become the #2 health web site in the US. These factors played a big role in helping us close this round of funding with a healthy uptick in valuation from our prior round. Together with the money already in the bank from our prior rounds, we now have more than enough runway to take the company to profitability and beyond.

A few months ago, we put out an alpha version of Kosmix.com. Many people used it and gave us valuable feedback; thank you! We listened, and made changes. Lots of changes. The result is the beta version of Kosmix.com, which we launched today. What's changed? More information sources (many thousands), huge improvements in our relevance algorithms, a much-improved user interface, and a completely new homepage. Give it a whirl and let us know what you think.

To those of you new to Kosmix, the easiest way to explain what Kosmix does is by analogy. Google and Yahoo are search engines; Kosmix is an explore engine. Search engines work really well if your goal is to find a specific piece of information -- a train schedule, a company website, and so on. In other words, they are great at finding needles in the haystack. When you're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:

  • Looking to bake a chocolate cake? We have recipes, nutrition information, a dessert burn rate calculator, blog posts from chow.com, even a how-to video from Martha Stewart!
  • Loved one diagnosed with diabetes? Doctor-reviewed guide, blood sugar and insulin pump slide shows, calculators and risk checkers, quizzes, alternative medications, community.
  • Traveling to San Francisco? Maps, hotels, events, sports teams, attractions, travel blogs, trip plans, guidebooks, videos.
  • Writing an article on Hillary Clinton? Bio, news, CNN videos, personal financial assets and lawmaker stats, Wonkette posts, even satire from The Onion.
  • Into Radiohead? Bio, team members, albums, tracks, music player, concert schedule, videos, similar artists, news and gossip from TMZ.
  • Follow the San Francisco 49ers? Players, news from Yahoo Sports and other sources, official NFL videos and team profiles, tickets, and the official NFL standings widget.


In the examples above, I'm especially pleased about the way Kosmix picks great niche sources for topics. For example, I hadn't heard about chow.com or known that Martha Stewart has how-to videos on her website. Other "gems" of this kind include Jambase, TMZ, The Onion, DailyPlate, MamaHerb, and Wonkette. Part of the goal of Kosmix is to bring you such gems: information sources or sites you have either not heard of, or just not thought about in the current context.

In other words: Google = Search + Find. Kosmix = Explore + Browse.  Browsing sometimes uncovers surprising connections that you might not even have thought about. The power of the model was brought home to me last week as I was traveling around in England. I'd heard a lot about Stonehenge and wanted to visit; so of course I went to the Kosmix topic page on Stonehenge. In addition to the usual comprehensive overview of Stonehenge, the topic page showed me places to stay in Bath, Somerset (which happens to be the best place to stay when you're visiting Stonehenge). It also showed me other ancient monuments in the same area I could visit while I was there. Score one for serendipity. 

Some of us remember the early days of the World Wide Web: the thrill of just browsing around, following links, and discovering new sites that surprise, entertain, and sometimes even inform. We have lost some of that joy now with our workmanlike use of search engines for precision-guided information finding. We built the new Kosmix homepage to capture some of the pleasure of aimless browsing -- exploring for pure pleasure. The homepage shows you the hot news, topics, videos, slide shows, and gossip of the moment. If you find something interesting you can dive right in and start browsing around that topic. We compile this page in the same manner as our topic pages: by aggregating information for many other sources and then applying a healthy dose of algorithms. Dig in; who knows what surprises await?

How does Kosmix work its magic? As I wrote when we put out the alpha, the problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. The web has evolved from a collection of documents that neatly fit in a search engine index to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases. The secret sauce is our algorithmic categorization technology. Given a topic, categorization tells us where the topic fits in a really big taxonomy, what the related topics are, and so on. In turn, other algorithms use this information to figure out the right set of information sources for a topic from among the thousands we know about. And then other algorithms figure out how to lay the information on the page in a 2-dimensional grid.

While we are proud of what we have built, we know there is still a long way to go. And we cannot do it without your feedback. So join the USS Kosmix on our maiden voyage. Our mission: to explore strange newtopics; to discover surprising new connections; to boldly go where no search engine has gone before!

Update: Vijay Chittoor has posted more details on the new product features on the Kosmix blog. Coverage on TechCrunch, GigaOM, VentureBeat. I'm particularly pleased that Om Malik thinks his page on Kosmix is better than the bio on his site!

December 08, 2008 in Data Mining, kosmix, Search | Permalink | Comments (3) | TrackBack (0)

Google Chrome: A Masterstroke or a Blunder?

The internet world has been agog over Google's entry into the browser wars with Chrome. When we look back to this event  several years from now with the benefit of hindsight, we might see it either as a master stroke, or as Google's biggest strategic misstep.

The potential advantages to the internet community as a whole are considerable. The web has evolved beyond its roots as a collection of HTML documents and dumb frontends to database applications. We now expect everything from a web application that we do from a desktop application, and then some more: the added bonus of connectivity to vast computing resources in the cloud. In this context, browsers need to  evolve from HTML renderers to runtime containers, much as web servers evolved from simple servers of static files  and cgi scripts to modern application servers with an array of plugins that provide a variety of services. Chrome is the first browser to explicitly acknowledge this transition and make it the centerpiece of their efforts, and will force other browsers to follow suit. We will all benefit.

The potential advantages to Google also are considerable. If the stars and planets align, they can challenge Microsoft's dominance on the desktop by making the desktop irrelevant. Even otherwise, they can hope to use their dominance in search to promote Chrome, gaining significant browser marketshare and ensuring that Microsoft cannot challenge Google's search dominance by building features into Internet Explorer and Windows that integrate MSN's search and other services.

Therein, however, lies the first and perhaps the biggest risk to Google. Until now, Microsoft has been unable to really use IE and Windows to funnel traffic to MSN services and choke off Google. Given their antitrust woes, they have been treading carefully on this matter. Any overt attempt by them will evoke cries of foul from many market participants. Google has been in a great position to lead the outcry, because it has been purely a service accessible from the browser, without any toehold in browser market itself.

Chrome, however, eases some of the pressure on Microsoft. If Microsoft integrates MSN search or other services tightly into IE, it will be harder for Google to cry foul -- Microsoft could point to Chrome, and any steps taken by Google to integrate their services into Chrome, as counter-arguments. In addition, any outcry from Google can now be characterized as sour grapes from a loser -- Microsoft can say, we both have browsers out there, they have one too, ours is just better, and let consumers decide for themselves.

In some sense, regardless of the actual market penetration of Chrome, Google has lost the moral high ground in future arguments with Microsoft. I wonder whether Google might have achieved all their aims better not by releasing a Google-branded browser, but by working with Mozilla to improve Firefox from within.

Second, while Google has shown impressive technological wizardry in search and advertising, the desktop application game is very different from the internet service game. While users are very forgiving about beta tags that stay for years on services such as gmail, user expectations on matters such as compatibility and security bugs are very high for desktop applications. It remains to be seen whether Google has the culture to succeed in this game, going beyond providing whiz bang features that thrill developers -- such as a blazingly fast Javascript engine -- to deliver a mainstream browser that competes on stability, security, and features.

The third problem is one of data contagion. Google has the largest "database of intentions" in the world today: our search histories, which form the basis of Google's ad targeting. The thing that keeps me from freaking out that Google knows so much about me is that I access Google using a third-party browser. If Google has access to my desktop, and can tie my search history to that, the company can learn much about me that I keep isolated from my search behavior. The cornerstone of privacy on the web today is that we can use products from different companies to create isolation: desktop from Microsoft, browser from Mozilla, search from Google. These companies have no incentive to share information. This is one instance where information silos serve us well as consumers. Any kind of vertical integration has the potential to erode privacy.

I'm not suggesting that Google would do anything evil with this data, or indeed that the thought even crossed their minds; thus far Google has behaved with admirable restraint is their usage of the database of intentions, staying away for example from behavioral targeting. But we should all be cognizant of the fact that companies are in business purely to benefit their shareholders. At some point, someone at Google might realize that the contents of my desktop can be used to target advertising, and it might be prove tempting in a period of slow revenue growth under a different management team.

Two striking historical parallels come to mind, one a masterstroke and the other a blunder, in both cases setting into motion events that could not be undone. In 49 BC, Julius Caesar crossed the Rubicon with his army, triggering a civil war where he triumphed over the forces of Pompey and became the master of Rome. And in 1812, Napoleon Bonaparte had Europe at his feet when he made the fateful decision to invade Russia, greatly weakening his power and leading ultimately to his defeat at Waterloo. It will be interesting to see whether Chrome ends up being Google's Rubicon or its St. Petersburg. Alea iacta est.

September 07, 2008 in Advertising, Search | Permalink | Comments (18) | TrackBack (0)

Why Google Doesn't Provide Earnings Forecasts

Most public companies provide forecasts of revenue and earnings in the upcoming quarters. These forecasts (sometimes called "guidance") form the basis of the work most stock analysts do to make buy and sell recommendations. Much to the consternation of these analysts, Google is among the few companies that have refused to follow this practice. As a result, estimates of Google's revenue by analysts using publicly available data, like comScore numbers, have often been spectacularly wrong. Today's earnings call may be no different.

A Google executive once explained to me why Google doesn't provide forecasts. To understand it, you have think about the engineers at Google who work on optimizing AdWords. How do they know they're doing a good job? We know that Google is constantly bucket-testing tweaks to their AdWords algorithms. An ad optimization project is considered successful if it has one of two results:

  • Increase revenue per search (RPS), while not using additional ad real estate on the search results page (SERP).
  • Reduce the ad real estate on each SERP, while not reducing RPS.

The tricky cases are the ones that increase RPS, while also using more ad real estate. It then becomes a judgment call on whether they should be rolled out across the site. If Google were to make earnings forecasts, the thinking went, there would be huge temptation to roll out tweaks in the gray area to make the numbers. As the quarters roll by, the area of the page devotes to ads would keep steadily increasing, leading to longer term problems with customer retention.

Of course, this doesn't mean there is no earnings pressure. In reality, whether they issue guidance or not, Google's stock price does depend on whether they continue to deliver robust revenue and earnings growth. So implicitly, there is always pressure to beat the estimates. And for the first time, as Google's stock has taken a hammering in recent months, I've heard about hiring slowdowns at Google. So there is definitely pressure to cut costs as well. It will be interesting to observe the battle between idealism and expediency play itself out, with its progress reflected in the ad real estate on Google's search results. It's easy to be idealistic with the wind behind your back; the true test is whether you retain the idealism in the face of headwinds. Time will tell.

This brings us to today's earnings call. In my experience, the best predictor of Google earnings has been Efficient Frontier's excellent Search Engine Performance Report. EF is the largest ad agency for SEM advertisers, and manages the campaigns of several large advertisers on Google, Yahoo, and Microsoft. As I had noted earlier, in Q1 an estimate based on their report handily beat other forecasts, most of which use ComScore data. (Disclosure: My fund Cambrian Ventures is an investor in EF.)

EF's report for Q2, released this morning, indicates a strong quarter for Google. Google gained more than its fair share of advertising dollars in Q2 2008. For every new dollar spent on search advertising, $1.10 was spent on Google, at the expense of Yahoo and Microsoft. In addition, Google's average cost-per-click (CPC) increased by 13.8% in Q2 2008 versus Q2 2007, while click volume and CTR increased as well. And, there was strong growth overseas as well, which should help earnings given the weak dollar.

I don't have the time right now to do the math and figure out whether the robust performance was sufficient to beat the Street's estimates. You should read the report for yourself and make that call.

Update: Google's results, although robust, were below expectations. The biggest moment in the earnings call for me was this quote from Sergey (via Silicon Alley Insider):

Sergey said the company may have overdone its quality control efforts in the quarter (reducing the number of ads), and the reversal of this could provide a modest accelerator to Q3

Quality efforts "overdone"? Apparently those pressures are telling after all, and Google is going abandon their principles a wee bit to venture into the grey zone. Is is the start of a slippery slope?

July 17, 2008 in Advertising, Search | Permalink | Comments (1) | TrackBack (0)

Searching for a Needle or Exploring the Haystack?

Note: This post is about a new product we're testing at my company Kosmix.

Search engines are great at finding the needle in a haystack. And that's perfect when you are looking for a needle. Often though, the main objective is not so much to find a specific needle as to explore the entire haystack.

When we're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:

  • Hiking the Continental Divide Trail.
  • A loved one recently diagnosed with arthritis.
  • You read the Da Vinci code and have an irresistible urge to learn more about the Priory of Sion.
  • Saddened by George Carlin's death, you want to reminisce over his career.

The web contains a trove of information on all these topics. Moreover, the information of interest is not just facts (e.g., Wikipedia), but also opinion, community, multimedia, and products. What's missing is a service that organizes all the information on a topic so that you can explore it easily. The Kosmix team has been working for the past year on building just such a service, and we put out an alpha yesterday. You enter a topic, and our algorithms assemble a "topic page" for that topic. Check out the pages for Continental Divide Trail, arthritis, Priory of Sion, and George Carlin.

The problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. As I've written before, the web has evolved from a collection of documents that neatly fit in a search engine index, to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases.

Here are some of the challenging problems that we had to tackle in building this product:

  1. Figuring out which which applications are relevant to a topic. For example, Boorah, Yelp, and Google maps are relevant to the topic "restaurants 94041". WebMD, Mayo Clinic, and RightHealth are relevant to "arthritis". If we called each application for every query, the page would look very confusing, and our partners would get unhappy very quickly! I'll write more on how we do this in a separate post by itself, but it's very, very cool  indeed.
  2. Figuring out related topics in the Related in the Kosmos section on each Topic page. For example, you can start from the Priory of Sion and laterally explore Rosslyn Chapel or the Madonna of the Rocks.
  3. Figuring out the placement and space allocation to each element in the 2-dimensional grid. Going from one dimension (linear list) to two dimensions (grid) turns out to be quite a challenge, both from an algorithmic and from a UI design point of view.

In this alpha, we've taken a first stab at tackling these challenges. We are still several months from having a product that we feel is ready to launch, but we decided to put this public alpha out there to gather user feedback and tune our service. Many aspects of the product will evolve between now and then: Do we have the right user interaction model for topic exploration? Do we put too much information on the topic page? Should we present it very differently? How do we combine human experts with our algorithms?

Most importantly, the Kosmix approach does not work for every query! Our goal is to organize information around topics, not answer arbitrary search queries. How do we make the distinction clear in the product itself? Can we carve out a separate niche from search engines?

We hope to gain insight into all these and more questions from this alpha. Please use it and provide your feedback!

June 26, 2008 in Search | Permalink | Comments (15) | TrackBack (0)

How Google Measures Search Quality

This post continues my prior post Are Machine-Learned Models Prone to Catastrophic Errors. You can think of these as a two-post series based on my conversation with Peter Norvig. As that post describes, Google has not cut over to the machine-learned model for ranking search results, preferring a hand-tuned formula. Many of you wrote insightful comments on this topic; here I'll give my take, based on some other insights I gleaned during our conversation.

The heart of the matter is this: how do you measure the quality of search results? One of the essential requirements to train any machine learning model is a a set of observations (in this case, queries and results) that are tagged with "scores" that measure the goodness of the results. (Technically this requirement applies only to so-called "supervised learning" approaches, but those are the ones we are discussing here.) Where to get this data?

Given Google's massive usage, the simplest way to get this data is from real users. Try different ranking models on small percentages of searches, and collect data on how users interacted with the results. For example, how does a new ranking model affect the fraction of users who click on the first result? The second? How many users click to page 2 of results? Once a user clicks out to result page, how long before they click the back button to come back to the search results page?

Peter confirmed that Google does collect such data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm. What they really use is a blast from the past. They employ armies of "raters"  who rate search results for randomly selected "panels" of queries using different ranking algorithms. These manual ratings form the gold-standard against which ranking algorithms are measured -- and eventually released into service.

It came as a great surprise to me that Google relies on a small panel of raters rather than harness their massive usage data. But in retrospect, perhaps it is not so surprising. Two forces appear to be at work. The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Let me try to explain the latter point. There are two broad classes of queries search engines deal with:

  • Navigational queries, where the user is looking for a specific uber-authoritative website. e.g., "stanford university". In such cases, the user can very quickly tell the best result from the others -- and it's usually the first result on major search engines.
  • Informational queries, where the user has a broader topic. e.g., "diabetes pregnancy". In this case, there is no single right answer. Suppose there's a really fantastic result on page 4, that provides better information any of the results on the first three pages. Most users will not even know this result exists! Therefore, their usage behavior does not actually provide the best feedback on the rankings.

Such queries are one reason why Google has to employ in-house raters, who have been instructed to look at a wider window than the first 10 results. But even such raters can only look at a restricted window of results. And using such raters also makes the training set much, much smaller than could be gathered from real usage data. This fact might explain Google's reluctance to fully trust a machine-learned model. Even tens of thousands of professionally rated queries might not be sufficient training data to capture the full range of queries that are thrown at a search engine in real usage. So there are probably outliers (i.e., black swans) that might throw a machine-learned model way off.

I'll close with an interesting vignette. A couple of years ago, Yahoo was making great strides in search relevance, while Google apparently was not improving as fast. Recall then that Yahoo trumpeted data showing their results were better than Google's. Well, the Google team was quite amazed, because their data showed just the opposite: their results were better than Yahoo's. They couldn't both be right -- or could they? It turns out that Yahoo's benchmark contained queries drawn from Yahoo search logs, and Google's benchmark likewise contained queries drawn from Google search logs. The Yahoo ranking algorithm performed better on the Yahoo benchmark and the Google algorithm performed better on the Google benchmark.

Two learnings from this story: one, the results depend quite strongly on the test set, which again speaks against machine-learned models. And two, Yahoo and Google users differ quite significantly in the kinds of searches they do. Of course, this was a couple of years ago, and both companies have evolved their ranking algorithms since then.

June 11, 2008 in Data Mining, Search | Permalink | Comments (15) | TrackBack (0)

Are Machine-Learned Models Prone to Catastrophic Errors?

A couple of days ago I had coffee with Peter Norvig. Peter is currently Director of Research at Google. For several years until recently, he was the Director of Search Quality -- the key man at Google responsible for the quality of their search results. Peter also is an ACM Fellow and co-author of the best-selling AI textbook Artificial Intelligence: A Modern Approach. As such, Peter's insights into search are truly extraordinary.

I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.

It has long been known that Google's search algorithm actually works at 2 levels:
  1. An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals.
  2. An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
The online, query-dependent phase appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple regression methods to Support Vector Machines. And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.

The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.

This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.

The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.

Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
  1. Mediocristan, consisting of phenomena that fit the bell curve model, such as games of chance, height and weight in humans, and so on. Here future observations can be predicted by extrapolating from variations in statistics based on past observation (for example, sample means and standard deviations).
  2. Extremistan, consisting of phenomena that don't fit the bell curve model, such as the search queries, the stock market, the length of wars, and so on. Sometimes such phenomena can sometimes be modeled using power laws or fractal distributions, and sometimes not. In many cases, the very notion of a standard deviation is meaningless.
Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan. In these cases, you can make quite a fool of yourself by assuming that the future looks like the past.

The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).

I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.

May 24, 2008 in Data Mining, Search | Permalink | Comments (28) | TrackBack (0)

Data, Algorithms, and Powerset

Powerset finally launched their search engine last night. It's been over two years in the making, and the company has much hype surrounding their natural language technology. 

The new search engine has received mixed reviews. For example, while Mike Arrington at TechCrunch fell in love at first sight, the comments following the post tell a very different story. The most balanced review I've seen so far is from Danny Sullivan at Search Engine Land.

Disclosure:
My company Kosmix also builds search technology, albeit of a very different kind. I don't consider Kosmix in any way to be a PowerSet competitor, and in this post I'm wearing my blogger hat and not my Kosmix hat. Moreover, this post is not really about PowerSet at all, but about data and algorithms, using PowerSet as the example du jour.

To boil down both my personal experience playing with PowerSet and what I'm seeing in the reviews: some of the features (like the Factz) are cool. The biggest weakness is that PowerSet today searches only two data sets: Wikipedia and Freebase. And while these are both very large and very useful datasets, they're not nearly enough to answer many real-world user queries.

The TechCrunch comments and Danny Sullivan's posts both contain examples that point out the strengths and weaknesses of PowerSet's search. Here's an example from my own personal experience: in my previous post, I used the phrase "new wine in an old bottle". I was curious about the origin of this phrase (since I had also heard people say "old wine in a new bottle" -- which came first?) So I typed the search "origin of expression new wine in old bottle" into both PowerSet and Google. Google nailed it in the first result (from Yahoo Answers), while Powerset was lost. Ditto for "How old was Gandhi when he was killed?"

While people can argue about the applicability of natural language processing to search, to the quality of PowerSet's implementation, and so on, I have a much simpler point of view. I believe PowerSet has some pretty cool IP in its algorithms and has done as good a job as they possibly can with it. The problem is, they don't have enough data to work with. (Another way of saying this is, PowerSet's index is a SubSet of Google's index.)

The primary reason Google's search is useful is that there is lots and lots of data on the web, and Google indexes  so much of it.  Yes, Google's search algorithms are fantastic, but they wouldn't work anywhere as well if they didn't have so much data underneath them. Consider my search about new wine and old bottles. The reason Google nails it is because there's a page on the web that uses the exact phrase I typed into the search box. Same for the Gandhi example. The cool thing for Google is, people who search often use phrases in the same was as people who write web pages (especially on community sites such as Yahoo Answers). Instead of doing any NLP, they just need to index a really huge corpus of text and look for near-exact matches.

To use a phrase from an earlier post on this blog, "More data usually beats better algorithms". And nowhere is that more true than it is in search. (The natural question to ask is, why not both more data and better algorithms? The answer in this case is that many of the techniques seem to rely on the structure and general cleanliness of Wikipedia and Freebase, so it's not clear how well they will scale to the web as a whole.)

What of PowerSet? I for one really love their Wikipedia search, and will use them to search Wikipedia and Freebase. As Danny Sullivan points out in his post, perhaps the right business model for PowerSet lies in enterprise search rather than in web search.

May 12, 2008 in Data Mining, Search | Permalink | Comments (5) | TrackBack (0)

Why Yahoo Glue is a Bigger Deal than You Think

Yahoo's India team quietly launched Yahoo Glue out of the glare of the media circus around Yahoo these days.  Glue has been noticed by a few commentators (e.g., TechCrunch), who mostly see it as response to Google's Universal Search (or Ask's search interface). They might be missing the point. Glue is fundamentally different from Universal Search and represents a whole new way of thinking about search.

The fundamental distinction is: where do the results come from? Google's Universal Search searches across Google's properties: web search, image search, YouTube, Scholar, Google News, and so on.  Glue, on the other hand, includes not just Yahoo properties (web search, Yahoo! Answers, Flickr), but also pulls in results from WebMD, HowStuffWorks, and even (ironically) Google Blog Search and YouTube. For example, compare Universal Search (diabetes) with Glue (diabetes). In this respect, Glue bears more similarity to mashups such  as Addict-o-Matic than to Universal Search.

The Web today is a far different place from what it was when Google's search paradigm was invented. The web was then a collection of documents; it is now a collection of applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Each application has its own deep collection of data, and we tend to think of them as being different information types rather than just "web pages". Yet the search model flattens each of these rich interactive services into a collection of web pages that can be indexed -- that's really putting very new wine in a very old bottle.

The one-index-fits-all model forces a linear ranking of incomparable types, such as images, videos, how-to content, facts, and opinion. That's really comparing apples to oranges. The correct way is to deal with each of these information sources as a first class citizen, with its own kind of data and interaction paradigm. Let them compete for real estate on a 2-dimensional search results page (as opposed to 1-dimensional list), and may the best ones win.

Of course, there are many technical challenges to be solved in getting there.  Sending each query to every web application is a recipe for disaster; and dealing with many different APIs is not conducive to scaling. To keep themselves relevant, however, search engines must evolve from indexes to intelligent routers of searches to third-party applications. 

May 09, 2008 in Search | Permalink | Comments (11) | TrackBack (0)

More data beats better algorithm at predicting Google earnings

Readers of this blog will be familiar with my belief that more data usually beats better algorithms. Here's another proof point.

Google announced earnings today, and it was a shocker -- for most of Wall Street, which was in a tizzy based on ComScore's report that paid clicks grew by a mere 1.8% year-over-year. In the event, paid clicks grew by a healthy 20% from last year and revenue grew by 30%.

In comparison, SEM optimizer Efficient Frontier released their Search Performance Report on their blog a few hours ahead of Google's earnings call. EF manages the SEM campaigns of some of the largest direct marketers, handling more SEM spend than anyone in the world outside of the search engines themselves. Their huge volumes of data give them more insight into Google's marketplace than anyone outside of Google.

EF reported a 19.2% increase in paid clicks and 11.2% increase in CPCs at Google Y-O-Y. Do the math (1.192*1.112 = 1.325), that's a 32.5% Y-O-Y revenue increase. That's the closest anyone got to the real numbers!  And this quarter is not a flash in the pan: in January, EF reported a 29% Y-O-Y increase in SEM spend, with 97% of the increased spend going to Google: that is, about a 28% Y-O-Y revenue increase for Google. That compares very favorably with the actual reported increase of 30%.

As Paul Kedrosky points out, this is a huge indictment of ComScore's methodology (ComScore's shares are trading down 8% after-hours post the Google earnings call). ComScore sets a lot of store on their "panel-based" approach, which collects data from a panel of users, similar to Nielsen's method of collecting data on TV viewing using data from a few households that have their set-top boxes installed. ComScore has been in this business longer than anyone else, and has arguably the best methodology (i.e., algorithm) in town to analyze the data. They're just not looking at the right data, or enough of it.  Some simple math using the mountain of data from EF handily beats the analysis methodology developed over several years using data from a not-so-large panel.

To my mind, this also puts in doubt the validity of ComScore's traffic measurement numbers. For websites where I personally know the numbers (based on server logs), both Quantcast and Hitwise come far closer to reality than ComScore. The latter two don't rely as heavily on a small panel. ComScore's value today is largely driven by the fact that advertisers and ad agencies trust their numbers more than the upstarts. Advertiser inertia will carry them for a while; but a few more high-profile misses could change that quickly.

Disclosure: Cambrian Ventures is an investor in EF. However, I don't have access to any information beyond that published in their public report.

April 18, 2008 in Advertising, Data Mining, Search | Permalink | Comments (6) | TrackBack (0)

The story behind Google's crawler upgrade

Alon Halevy and Jayant Madhavan have a post on Google's Webmaster blog disclosing that Google is now harvesting data that is hidden behind HTML forms. I'm very satisfied about this, because it's something that I had a hand in making happen.

I've known Alon since 1995, when he was a researcher and I was his summer intern at Bell Labs in New Jersey (those were the days when Bell Labs was still relevant to Computer Science research). Until that summer, Alon and I had been working independently of each other on the topic of integrating information from across different data sources (such as websites) -- Alon at Bell Labs and I at Stanford, for my PhD dissertation. We put together our ideas and created the Information Manifold, an information integration system that introduced a key idea: you could describe each data source (declaratively) by a local data schema; map that schema to a global schema that unified all the sources; and process queries across the global schema.

This task is complicated by the fact that each data source exposes different query processing capabilities, by virtue of the different HTML forms it uses. We came up with some simple solutions to this problem, but several difficulties remained. In the meantime, Alon and I went our ways with our careers; Alon joined the CS faculty the University of Washington, while I started a company, Junglee, that applied information integration ideas to build the first comparison shopping engine and online job bank. Junglee was acquired by Amazon.com in 1998; in 2000, I started a Venture Capital firm called Cambrian Ventures and in late 2004 a startup, Kosmix.

Alon and I kept in touch through all this, and in 2004, he and his student Jayant Madhavan had made great progress on "schema matching", a research area that has application to the HTML forms problem we had first encountered in the Information Manifold. I was very happy to egg them on to commercialize their breakthrough, and to provide the funding for the resulting company, Transformic, through Cambrian Ventures.

Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. Called the Invisible Web, this data is often estimated to be even larger in size and usefulness than the "visible web" that web crawlers usually index. The key problems in indexing the Invisible Web are:

  1. Determining which web forms are worth penetrating.
  2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging -- we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2).  It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic -- a nice return for Cambrian, but also a great place for the Transformic team to make a real difference with their ideas. The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler. I'm very happy to have played a small role in their success story.

An aside on the world of academic research. Alon and I had some difficulty getting our Information Manifold research published: it was rejected the first time we submitted it to a leading academic conference; we had to address a lot of criticism and skepticism from the establishment, and it was finally published at the VLDB conference in 1996. Remarkably, this paper has since become one of the most cited and influential papers in its field (see this survey, page 52). In 2006, the paper received the 10-year Best Paper Award at VLDB, given retrospectively to the publication from 10 years ago that made the most impact. The moral of the story is, sometimes it pays to swim against the current in academic research; what is fashionable today is rarely what will ultimately make the most impact.

April 12, 2008 in Search | Permalink | Comments (1) | TrackBack (0)

About

  • Anand Rajaraman
  • Datawocky

Recent Posts

  • Stanford Big Data Course Now Open to the World!
  • Goodbye, Kosmix. Hello, @WalmartLabs
  • Retail + Social + Mobile = @WalmartLabs
  • Creating a Culture of Innovation: Why 20% Time is not Enough
  • Reboot: How to Reinvent a Technology Startup
  • Oscar Halo: Academy Awards and the Matthew Effect
  • Kosmix Adds Rocketfuel to Power Voyage of Exploration
  • For Startups, Survival is not a Strategy
  • Google Chrome: A Masterstroke or a Blunder?
  • Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Recent Comments

  • mona on Stanford Big Data Course Now Open to the World!
  • Voyager on Stanford Big Data Course Now Open to the World!
  • Gautam Bajekal on Stanford Big Data Course Now Open to the World!
  • online jobs on Not all powerlaws have long tails
  • rc helicopter on Not all powerlaws have long tails
  • tory burch outlet on Goodbye, Kosmix. Hello, @WalmartLabs
  • SHARETIPSINFO on Goodbye, Kosmix. Hello, @WalmartLabs
  • Almeda Alair on Goodbye, Kosmix. Hello, @WalmartLabs
  • discount mbt on Retail + Social + Mobile = @WalmartLabs
  • custom logo design on Retail + Social + Mobile = @WalmartLabs

Archives

  • September 2014
  • May 2011
  • April 2011
  • April 2009
  • February 2009
  • December 2008
  • November 2008
  • September 2008
  • July 2008
  • June 2008

More...

Blogroll

  • The Numbers Guy
  • Paul Kedrosky's Infectious Greed
  • Life in the Bit Bubble
  • Kosmix Blog
  • John Battelle's Searchblog
  • GigaOM
  • Geeking with Greg
  • Efficient Frontier Insights
  • Data Mining Research
  • Constructive Pessimist, Cynical Optimist

 Subscribe in a reader

Subscribe to Datawocky by Email

Popular Posts

  • Are Machine-Learned Models Prone to Catastrophic Errors?
  • Why the World Needs a New Database System
  • Why Yahoo Glue is a Bigger Deal than You Think
  • The story behind Google's crawler upgrade
  • Affinity and Herding Determine the Effectiveness of Social Media Advertising
  • More data usually beats better algorithms, Part 2
  • More data usually beats better algorithms
  • How Google Measures Search Quality
  • Angel, VC, or Bootstrap?
  • India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime

Categories

  • Advertising (6)
  • Data Mining (11)
  • Entrepreneurship: views from the trenches (2)
  • India (5)
  • Internet Infrastructure (3)
  • kosmix (2)
  • Lewis Carroll (1)
  • Mobile (6)
  • Search (10)
  • Social Media (2)
  • Venture Capital (4)
See More

Twitter Updates

    follow me on Twitter