A couple of days ago I had coffee with Peter Norvig. Peter is currently Director of Research at Google. For several years until recently, he was the Director of Search Quality -- the key man at Google responsible for the quality of their search results. Peter also is an ACM Fellow and co-author of the best-selling AI textbook Artificial Intelligence: A Modern Approach. As such, Peter's insights into search are truly extraordinary.
I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.
It has long been known that Google's search algorithm actually works at 2 levels:
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.
Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).
I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.
I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.
It has long been known that Google's search algorithm actually works at 2 levels:
- An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals.
- An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.
Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
- Mediocristan, consisting of phenomena that fit the bell curve model, such as games of chance, height and weight in humans, and so on. Here future observations can be predicted by extrapolating from variations in statistics based on past observation (for example, sample means and standard deviations).
- Extremistan, consisting of phenomena that don't fit the bell curve model, such as the search queries, the stock market, the length of wars, and so on. Sometimes such phenomena can sometimes be modeled using power laws or fractal distributions, and sometimes not. In many cases, the very notion of a standard deviation is meaningless.
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).
I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.
I've addressed some of the questions here in a followup post:
http://anand.typepad.com/datawocky/2008/06/how-google-measures-search-quality.html
Posted by: Anand Rajaraman | June 12, 2008 at 07:20 PM
Excellent post!
It occurs to me that most of the data generated on the internet that I've had to analyze and predict (such as user value and behavior) fall under the Extremistan group. Simply because the web and the website change so rapidly and extremely.
Posted by: Yaniv Bar-Lev | June 19, 2008 at 09:16 AM
In this context there are a few special cases that the commentors (and original author) are overlooking:
1) this is not a normal machine learning problem. It is machine learning in an adversarial environment. The other classic example of this is fraud detection. In both cases the adversary (spammers or fraudsters) are always changing methods according to what works for them (aka fails for the white hats). It takes special care and lots of human oversight and QA to avoid getting nailed by the bastards.
2) the "human" version of the algorithm is, no doubt, being crafted based on extensive examination of the output of machine learned models. In most large-data situations, human experts working on their own are pretty much guaranteed to not do as well on average as reasonably well built machine models. The augmented human working with machine models for reference, on the other hand, can be a fearsomely effective creature.
Posted by: Ted Dunning | June 23, 2008 at 10:45 AM