I teach a class on Data Mining at Stanford. Students in my class are expected to do a project that does some non-trivial data mining. Many students opted to try their hand at the Netflix Challenge: to design a movie recommendations algorithm that does better than the one developed by Netflix.
Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this.
Another fine illustration of this principle comes from Google. Most people think Google's success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:
- The recognition that hyperlinks were an important measure of popularity -- a link to a webpage counts as a vote for it.
- The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.
First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets -- hyperlinks and anchortext -- took Google's search to the next level. The PageRank algorithm itself is a minor detail -- any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.
The same principle also holds true for another area of great success for Google: the AdWords keyword auction model. Overture had previously proved that the model of having advertisers bid for keywords could work. Overture ranked advertisers for a given keyword based purely on their bids. Google added some additional data: the clickthrough rate (CTR) on each advertiser's ad. Thus, to a first approximation, Google ranks advertisers by the product of their bid and their CTR (this was true in the first version of AdWords; they now use more considerations). This simple change made Google's ad marketplace much more efficient than Overture's. Notice that the algorithm itself is quite simple; it is the addition of the new data that made the difference.
To sum up, if you have limited resources, add more data rather than fine-tuning the weights on your fancy machine-learning algorithm. Of course, you have to be judicious in your choice of the data to add to your data set.
Update: Thanks for all the comments. Some of them raise interesting issues, which I'd like to address in a follow-up post. I'm traveling without good internet access until the weekend, so it will have to wait until I get back. Stay tuned!
Update 2: There's now a follow-up post that addresses some of the issues raised in the comments.
Update 3: Here's another illustration of the same point: Simple math using lots of data is way more accurate than comScore's analysis in predicting Google's earnings.
"Team B got much better results, close to the best results on the Netflix leaderboard!!"
I am deeply skeptical of this claim. If this is indeed true, why is this team not on the leaderboard? If they got close to 9% improvement, what is stopping them from blending with results from other published algorithms and claiming the $$$?
Posted by: Skeptic | April 01, 2008 at 09:10 PM
I have to say I'm not amazed. We did same kind of ratings a long time ago, guess where - right, in insurance. Part of risk management. Is a red headed under 30 less risk than a blond at the same age? Better - what and how costly will be their next accident. Trying to predict human behavior, the cause and the results has been there a long time. The same was done for example ships world wide we insured but there wasn't just the information of the shipping company, we did background checking and information collection of the captain, the crew, companies using/shipping the goods, etc, the more information, the more accurate the risk estimates.
To the topic, the more information you get, external or internal, the better the estimate, proven in my mind (and in your insurance rates.)
Now - this creates an interesting dilemma, how much information you can get and how much information collection the targets will tolerate? Another subject!
Posted by: Tuomo Stauffer | April 02, 2008 at 12:11 AM
This article pinpoint something that has been true for a long time: more data usually beats better algorithms. Therefore, assuming that the data mining algorithmns are not the issue (assuming good science behind them, which I have found in all the major software vendors), the issue then becomes the quality of the interactive visualization tool that allows end-users to make better decisions. Fed Chairman Bernanke, when at Princeton, published a paper that is complimentary to this issue.
Posted by: Alberto | April 02, 2008 at 08:35 AM
Regarding "More data usually beats better algorithms"
I would say rather that "more data and better algorithms are two ways to seek better performance". Which (by itself) will provide the greater improvement can only be decided on a case-by-case basis. The experience directly described in this article is, after all, only a single observation.
Posted by: Will Dwinnell | April 02, 2008 at 04:05 PM
How is A better than B without benchmarking
and without criterions of comparison?
Posted by: Anonimo | April 09, 2008 at 04:30 PM
Is X quantity of data better than the best algorithm?
I want better still it!
So, More DATA = More required BANDWITH.
Is it better with trillions of TeraBytes of data with a bandwith of gigabytes per second?
It NEVER terminates! It got worser!
Posted by: Anonimo | April 09, 2008 at 04:36 PM
I'm not sure what the hold-up is... maybe they have re-thought their stance on how this is going to actually make the company any money. Or perhaps their lawyers pointed out the liability of providing agents a platform to stick their feet in their mouth. Whatever it is, it's hardly something I'd claim as being "Well done".
www.jebshouse.com
Posted by: Jebs House | April 14, 2008 at 09:23 AM
I the usual person also live as everything, but sometimes happens, that it would be desirable that that of the best. At present I do not have girl and to me is very sad, that at me such here a black strip in a life! I am assured, that I not one such, sometimes I look video from the given site, there I always find for myself that that brand new and interesting can be and you there that that for myself will find!
[url=http://adultscreensavers.freevar.com]Only the earnest entreaty for all is more senior 18 years! [/url]
I will be glad to hear your opinions on this favourite my site!)))
[b]P.S. Never surrender friends![/b]
Posted by: GeritonKiss | July 10, 2008 at 11:20 PM
Great useful blog,so I must subscrible this.Finaly I found what I search...
Posted by: Noclegi Karpacz | August 30, 2008 at 02:35 AM