A couple of days ago I had coffee with
Peter Norvig. Peter is currently Director of Research at Google. For several years until recently, he was the Director of Search Quality -- the key man at Google responsible for the quality of their search results. Peter also is an ACM Fellow and co-author of the best-selling AI textbook
Artificial Intelligence: A Modern Approach. As such, Peter's insights into search are truly extraordinary.
I have known Peter since 1996, when he joined a startup called
Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.
It has long been known that Google's search algorithm actually works at 2 levels:
- An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals.
- An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
The online, query-dependent phase appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple
regression methods to
Support Vector Machines. And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known
problem of induction in logic.
The classic example is the
Black Swan, popularized by Nassim Taleb's
eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.
Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
- Mediocristan, consisting of phenomena that fit the bell curve model, such as games of chance, height and weight in humans, and so on. Here future observations can be predicted by extrapolating from variations in statistics based on past observation (for example, sample means and standard deviations).
- Extremistan, consisting of phenomena that don't fit the bell curve model, such as the search queries, the stock market, the length of wars, and so on. Sometimes such phenomena can sometimes be modeled using power laws or fractal distributions, and sometimes not. In many cases, the very notion of a standard deviation is meaningless.
Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan. In these cases, you can make quite a fool of yourself by assuming that the future looks like the past.
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).
I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.
Recent Comments