I teach a class on Data Mining at Stanford. Students in my class are expected to do a project that does some non-trivial data mining. Many students opted to try their hand at the Netflix Challenge: to design a movie recommendations algorithm that does better than the one developed by Netflix.
Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this.
Another fine illustration of this principle comes from Google. Most people think Google's success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:
- The recognition that hyperlinks were an important measure of popularity -- a link to a webpage counts as a vote for it.
- The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.
First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets -- hyperlinks and anchortext -- took Google's search to the next level. The PageRank algorithm itself is a minor detail -- any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.
The same principle also holds true for another area of great success for Google: the AdWords keyword auction model. Overture had previously proved that the model of having advertisers bid for keywords could work. Overture ranked advertisers for a given keyword based purely on their bids. Google added some additional data: the clickthrough rate (CTR) on each advertiser's ad. Thus, to a first approximation, Google ranks advertisers by the product of their bid and their CTR (this was true in the first version of AdWords; they now use more considerations). This simple change made Google's ad marketplace much more efficient than Overture's. Notice that the algorithm itself is quite simple; it is the addition of the new data that made the difference.
To sum up, if you have limited resources, add more data rather than fine-tuning the weights on your fancy machine-learning algorithm. Of course, you have to be judicious in your choice of the data to add to your data set.
Update: Thanks for all the comments. Some of them raise interesting issues, which I'd like to address in a follow-up post. I'm traveling without good internet access until the weekend, so it will have to wait until I get back. Stay tuned!
Update 2: There's now a follow-up post that addresses some of the issues raised in the comments.
Update 3: Here's another illustration of the same point: Simple math using lots of data is way more accurate than comScore's analysis in predicting Google's earnings.