A couple of days ago I had coffee with Peter Norvig. Peter is currently Director of Research at Google. For several years until recently, he was the Director of Search Quality -- the key man at Google responsible for the quality of their search results. Peter also is an ACM Fellow and co-author of the best-selling AI textbook Artificial Intelligence: A Modern Approach. As such, Peter's insights into search are truly extraordinary.
I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.
It has long been known that Google's search algorithm actually works at 2 levels:
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.
Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).
I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.
I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.
It has long been known that Google's search algorithm actually works at 2 levels:
- An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals.
- An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.
This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.
The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.
Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
- Mediocristan, consisting of phenomena that fit the bell curve model, such as games of chance, height and weight in humans, and so on. Here future observations can be predicted by extrapolating from variations in statistics based on past observation (for example, sample means and standard deviations).
- Extremistan, consisting of phenomena that don't fit the bell curve model, such as the search queries, the stock market, the length of wars, and so on. Sometimes such phenomena can sometimes be modeled using power laws or fractal distributions, and sometimes not. In many cases, the very notion of a standard deviation is meaningless.
The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).
I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.
Is it the case that human-crafted formulae
are better at handling unseen searches? Is
this supported by current data?
If so, it is natural to wonder
what "algorithm" humans are using.
Posted by: Chandra | May 24, 2008 at 07:46 PM
Once again a great post! I wonder if a hybrid approach can be used. When input data is very different from the training data then the model could switch to the manually crafted one. During which all such data continues to either improve or create machine learned models. As the confidence in the machine learned model improves the switch can be reverted. One take away for me is that models should come in pairs, a machine learned one along with a human crafted one so that if one starts to fail for unexpected data the other can take over.
Posted by: abhishek | May 25, 2008 at 12:21 AM
What about neural networks? Aren't they supposed to work as well in Extremistan as in Mediocristan when trained right?
Posted by: anshul | May 25, 2008 at 12:22 AM
I agree with the comment:
""Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan.""
There is also the reverse "Hiding in plain sight" problem that causes humans to miss unusual events.
to anshul: I read neural network feedback problems create the same chaos phenomena as fractal or cellular models do - maybe someone knows more about this.
Good post - look forward to the sequels
Posted by: bcarpent1228 | May 25, 2008 at 06:05 AM
I liked Taleb's books (though I thought Fooled by Randomness was better written than The Black Swan), but I think there are two separate issues here.
The first is the extent to which we can extrapolate from the past (i.e., the training data) to the future. The second is whether the variable of interest looks more or less like a Gaussian.
Both algorithms and humans are susceptible to modeling failures on both accounts. Indeed, an algorithm is nothing more than a codification of a human formulation of the problem, put on automatic. At least the machine is not itself subject to cognitive biases. But conversely our algorithms aren't good at expanding their world views. Here is where I see the most value in having human overseers around to help stave off the "black swans" of catastrophic failure.
Posted by: Daniel Tunkelang | May 25, 2008 at 06:33 AM
The human mind is computable, so is any ranking algorithm invented by puny humans... One day we will take over ha ha ha
Posted by: robocop | May 25, 2008 at 07:40 AM
The human mind is computable, so is any ranking algorithm invented by puny humans... One day we will take over ha ha ha
Posted by: robocop | May 25, 2008 at 07:41 AM
I thought the mortgage crisis was the side effect of computer models using only a coupld decades (at most) worth of market data; a classic case of machine model breaking down when the market behaved unexpectedly (intrest rate resets).
The continued use of these models, even when they broke down in early 2007 was simple fraud-for-profit on part of the humans.
Posted by: Kyle | May 25, 2008 at 10:07 AM
This raises the interesting question if there can be *any* machine learning working well in extremistan. As far as I can tell from my limited knowledge, decision-making under unforseen circumstances is not exactly well-understood.
Oh, and the mortgage crisis was not exactly "unforseen" except by those who profited from it. I've read articles as far back as at least 2002 warning that the current lending/leveraging scheme was unmaintainable.
Posted by: Robert 'Groby' Blum | May 25, 2008 at 10:58 AM
"Machine-learned" models are also hand-crafted, just crafted on a different higher level.
Machine learning always includes "regularization" and cross validation and other technologies to reduce harm on out-of-sample error.
It is true that ML, done poorly, might result in weirder search results for some "out of sample" searches.
But then, continuously retrained ML can also respond to subtle drifts, not generally apparent to humans, that hand crafted algorithms might not.
Humans craft the ML representations such that the regularization procedures automatically bring you closer to a good "default" for out of sample examples.
ML can also try to self-validate---"Am I close to the sample space which I am trained on?", and if not, go to some alternative method.
And in practice, people can and should employ "champion/challenger" strategies to test, in real time out of sample data, various methods.
I am very surprised---I had assumed Google would be using such ML methods for a very long time now.
Posted by: DrChaos | May 25, 2008 at 11:00 AM
abhishek said: "One take away for me is that models should come in pairs, a machine learned one along with a human crafted one so that if one starts to fail for unexpected data the other can take over."
When do you define failure? If it's after the engine has returned bad results for a search query, it's already too late.
If there is a trend of systematic failure, how do you detect it? Maybe if a larger proportion of users stop clicking on the search results... but you'd have to leave the algorithm in place for a while to see if it's a real trend or just a coincidence.
In other words, there's no way to swap algorithms on the fly because you suspect the results might be bad. At best, you can study the data when the human model and the computer model are widely divergent. But again, that's offline, not online.
Posted by: Dan Lewis | May 25, 2008 at 11:09 AM
Great post...I think the most important thing here is to be open to the potential that machine models cane be wrong...Just see the ratings Fiasco from moody with sub-prime mortgage based securities. If one is open to the possibility that machine models can be wrong then the corrective actions can be taken before catastrophic errors.
Thanks, Jitendra
Posted by: Jitendra | May 25, 2008 at 12:33 PM
Both the post and comments are fantastic.
(My understanding of) Chaotic results occur from the computational limits of computers - it should be fairly easy for a human (who is doing the search) to detect chaotic results. Basically - if i determine the search results are in error then i will modify my search parameters.
My concern in any search is not the results found but the results not found. I also get confused when formulating a complicated search - assuming ""if i ask the right question i will get the right answers""
Possibly a greater effort on search parameters and methods (opensearch.org) would assist the server algorithms.
Posted by: bcarpent1228 | May 25, 2008 at 02:07 PM
Most neural networks, such as perceptrons, assume a linear relation between their input and output, so they work in Mediocristan.
I don't know if other types of networks work different. Boltzmannn machines have a “factorial model” (whatever that exactly means), so their model is not linear, and also generally impossible to learn efficiently, but of course any model is an hypothesis on the data.
Posted by: Tobu | May 25, 2008 at 02:19 PM
Are the black swans completely and truly novel?
Even if we haven't encountered a black swan before, we know that a swan is an animal and that animals come in various shades, including black. So the posterior probability for encountering a black swan should be low but not zero.. Maybe the solution is to use methods that learn model with multiple levels of generality. Of course, that requires some ontology for dealing with the training/testing samples..
Posted by: abhik | May 25, 2008 at 07:36 PM
With reference to the last commenter, even if we have learned a model at different levels of generality, it is quite hard to figure out which level to apply. I think its inherently hard for an automated system to know when it is wrong, perhaps the best it can do is to know when it is not confident of its answer.
Taking a more data intensive approach as outlined in previous posts, I wonder if the injection of random outliers may allow the ML algorithms to better handle or atleast know to be not confident when faced with extremistan.
With respect to Google, I wonder how often these hand-crafted algorithms have been tunes and what mechanisms those researchers use to do the tuning. Perhaps we can have a program replicate that process. Hence the ML algorithm will have a tutor/examiner who will routinely subject it to testing and validation.
Posted by: leotor | May 25, 2008 at 08:24 PM
So what would catastrophic failure look like in a top-ten search results list?
Spam?
Results that do not contain my keyword?
Results in different languages?
404 not found? 500 error?
Google's hard disk gets reinitialized?
Just what are we talking about here?
Posted by: Mark | May 25, 2008 at 09:12 PM
Great post...I think the most important thing here is to be open to the potential that machine models cane be wrong...Just see the ratings Fiasco from moody with sub-prime mortgage based securities. If one is open to the possibility that machine models can be wrong then the corrective actions can be taken before catastrophic errors.
Thanks, Jitendra
Posted by: Jitendra | May 25, 2008 at 10:24 PM
Well, I believe Human Crafted models are as susceptible to catastrophic errors as machine learned models are. Fundamentally, it should not make much difference *who* made the model. After all, a model is a model, irrespective of how and who created it.
In fact, I would argue that machine learning is better than human crafted because the former method ensure removal of biases and lack of knowledge which is inevitable for human crafted models. Moreover, machine learning models are trainable by definition. So, on some inputs even if it doesn't perform as well as it should, it can always retrain the model. Can we do this with hand crafted models?
By the way, on a different note, on many searches I am observing Google is throwing up more irrelevant results than it did in the past. Is it human crafted models at play?
Posted by: Paras Chopra | May 25, 2008 at 10:46 PM
Dan Lewis said: "When do you define failure? ..."
It is often possible to derive a level of confidence in the results of a ML algorithm. Suppose you have a neural network doing a digit recognition task. If the '5' node is getting full activation, you can be fairly confident that the digit was a 5. Whereas if it would not have gotten as much activation, or the '6' node has only slightly less activation, you can't be as sure and you might want to fall back on something else (i.e. a hand-tuned formula).
Also, I think it is possible to detect confidence in other ways (possibly even by teaching the algorithm to generate an extra output for the confidence level).
Tobu said: "Most neural networks, such as perceptrons, assume a linear relation between their input and output"
I'm not really sure, but I think this is only the case if you don't use a hidden layer in your networks. I hate to quote Wikipedia, but here goes:
"[N]eural networks are non-linear statistical data modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data."
Posted by: Jordi | May 25, 2008 at 11:36 PM
As the saying goes 'All models (read machine learned/manually developed with domain expertise) are wrong. But some are useful'. On the utility of the models in different scenarios my way is to question them as often and as thoroughly as possible and correct them if necessary. This brings to front lot of questions about when to rebuild the model etc., which should be answered based on the needs and the objectives rather than philosophical issues. Nice post.
Posted by: Bharatheesh Jaysimha | May 26, 2008 at 02:33 AM
I don't think Humans make better decisions around these kinds of rare or unforeseen events. We're just more forgiving when those decisions turn out to be wrong. The decision problems where machine learning methods can fail 'catastrophically' are the same ones where people do: decisions on complex systems with incomplete information or intractable objectives.
In the case of Google's ranking formula, hand tuning a linear model in 200 features can be superior to machine learning since good objectives can be hard to formulate. For example, they surely want to minimize the percentage of abandoned searches -- e.g. p(abandon) -- but that involves knowing p(click) for each result given the *other* results shown. Of course, those conditional parameters are not known and even if it they were that optimization would still be intractable.
Instead, as described, they have used a 'generate and test' approach: let smart people come up with model features and hand tune parameters for seven or eight years. The result is a broadly successful model unlikely to be beaten by machine learning approaches hampered by both insufficient information and reduced target objectives (such as per result click through probability or editorial grade predictions, etc.)
Posted by: Gordon Rios | May 29, 2008 at 04:08 PM
You people are going down!
Posted by: steveballmer | May 31, 2008 at 12:44 AM
Thank you to everyone who posted insightful comments. I'm on vacation until mid-week, when I'll write a follow-up post that addresses some of the points raised in the comments.
Posted by: Anand Rajaraman | June 08, 2008 at 08:16 PM
I think that there is NO difference. At the scale we are talking about machines do exactly what humans do (since humans are the ones telling them what to do). They only do it faster. The human is judging the result based upon some criteria. The same criteria can to supplied to the machine. If the criteria is "learn from experience" then again the machine learns faster since it has more "experience" (i.e. past behavior)
Posted by: Shiraz Kanga | June 12, 2008 at 02:36 PM