I teach a class on Data Mining at Stanford. Students in my class are expected to do a project that does some non-trivial data mining. Many students opted to try their hand at the Netflix Challenge: to design a movie recommendations algorithm that does better than the one developed by Netflix.
Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!
Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?
Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this.
Another fine illustration of this principle comes from Google. Most people think Google's success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:
- The recognition that hyperlinks were an important measure of popularity -- a link to a webpage counts as a vote for it.
- The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.
First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets -- hyperlinks and anchortext -- took Google's search to the next level. The PageRank algorithm itself is a minor detail -- any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.
The same principle also holds true for another area of great success for Google: the AdWords keyword auction model. Overture had previously proved that the model of having advertisers bid for keywords could work. Overture ranked advertisers for a given keyword based purely on their bids. Google added some additional data: the clickthrough rate (CTR) on each advertiser's ad. Thus, to a first approximation, Google ranks advertisers by the product of their bid and their CTR (this was true in the first version of AdWords; they now use more considerations). This simple change made Google's ad marketplace much more efficient than Overture's. Notice that the algorithm itself is quite simple; it is the addition of the new data that made the difference.
To sum up, if you have limited resources, add more data rather than fine-tuning the weights on your fancy machine-learning algorithm. Of course, you have to be judicious in your choice of the data to add to your data set.
Update: Thanks for all the comments. Some of them raise interesting issues, which I'd like to address in a follow-up post. I'm traveling without good internet access until the weekend, so it will have to wait until I get back. Stay tuned!
Update 2: There's now a follow-up post that addresses some of the issues raised in the comments.
Update 3: Here's another illustration of the same point: Simple math using lots of data is way more accurate than comScore's analysis in predicting Google's earnings.
Insightful post. Couple of points:
1) Fine-tuning an algorithm to work with existing data also runs the risk that new data wont be accepted even if its available(the issue of scale).
2) More data also leads to less noise if it’s random (averaging).
3) Of-course data when not selected judiciously can also introduce bias. E.g. if netflix and imdb data represents the US demographic it wont really apply well to an Asian demographic. Data selection almost deserves a post of its own.
4) Algorithms can discover what they don’t know and then actively seek data to bridge the gap (learn actively not passively)
Nice read
Posted by: Abhishek Gattani | March 24, 2008 at 07:38 PM
Great post!! I've subscribed to your blog :)
It's the most powerful technique one can use to improve the performance of their machine learning algorithms and it's also the most over looked.
A caution is necessary though. As much as links and anchor text have helped Google gain an edge one can argue another piece of data 'meta tags' haven't helped search relevance. All data is not useful but independent data is usually useful. In the cited example one can argue that IMDB and Netflix are independent sources of data and that helped the cause more than more of Netflix data.
Posted by: mailman | March 25, 2008 at 11:43 AM
Totally agree; learning over webscale data generally improves with more independent signals --- key is to find as many high coverage data sources as available.
Posted by: gaurav | March 25, 2008 at 12:52 PM
Data is the new defensibility. Many previous forms of defensibility have melted away: patents are completely disrespected and too costly to enforce, technology and algorithms can be reverse engineered by 2 students with a case of Raman and Red Bull in a month, many companies are using open source technology that's completely commodity, and developing a web service is so capital efficient that out-fundraising your competition is nothing more than a distraction.
But, data is still a defensible advantage. Get more data than your competition, get proprietary data that no one else can access, generate your own implicit usage data... that's the remaining defensibility in the web services market today.
Posted by: Andrew Parker | March 28, 2008 at 09:47 AM
Recently ran into the exact same experience. Brainstormed half-dozen approaches for a problem, set high expectations for most complex, but thankfully had enough smarts to try the simplest possible approach as a baseline first. To my own surprise (happens almost every time), the marginal gains of the complex algorithm were definitely not worth the effort!
Posted by: Ilya Grigorik | March 31, 2008 at 05:32 AM
Great post, and I've subscribed and linked to your blog.
I have been trying to find the careful balance between 'additional data' and 'curse of dimensionality' in my work, and I find it's more of a razor-wire than a tightrope.
Do you have any papers (that you authored or have read) on the subject that you recommend?
Cheers.
Posted by: Glen Coppersmith | March 31, 2008 at 08:05 AM
When I took machine learning last fall, a group of us worked on the Netflix data for our class project. We compared various learning algorithms: three or four that we'd covered in the course, plus a naïve one I made up for comparison. The sophisticated algorithms just used the Netflix data directly; the naïve one joined with the IMDB genre list, and said "film F is genre G; your average preference for genre G is P, so I'm going to predict your preference for F is P". The naïve algorithm beat the sophisticated ones hands down.
Posted by: John Stracke | March 31, 2008 at 08:37 AM
Hello Anand ---
I'm not a Netflix contest competitor, but I've been following it fairly closely since its start. To my knowledge, none of the current leaders are using external genre information. Thus I'm surprised by your conclusion that "adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set."
To the contrary, one of the current leaders concludes the opposite: "Of course, weak collaborative-filtering (CF) models can benefit from such IMDB information, but our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset." (Yehuda Koren, http://glinden.blogspot.com/2008/03/using-imdb-data-for-netflix-prize.html?showComment=1206557280000#c4525777289161540480)
Perhaps offering more specifics about the approaches and scores would make this clearer? And if you happen to have a IMDB-Netflix mapping, I personally would love to have a copy of it to play around with. I've been meaning to make one myself, but haven't yet found the time.
Thanks!
Nathan Kurz
[email protected]
Posted by: Nathan Kurz | March 31, 2008 at 09:35 AM
Nice post. Just a minor correction for the record. The first release of AdWords charged a fixed rate cost-per-impression, not cost-per-click. There was no bidding. CTR was the only factor that affected the selection of which ads were displayed for a given query.
Posted by: Ron Garret | March 31, 2008 at 10:15 AM
Interesting claim about Google's algorithm not being particularly special. Do you have data to back up the following claim:
"The PageRank algorithm itself is a minor detail -- any halfway decent algorithm that exploited this additional data would have produced roughly comparable results."
Posted by: travis | March 31, 2008 at 12:52 PM
Meta Data in web pages has been ignored by search engines for many years. I am still surprised at how many incompetent web developers still put that useless garbage in all there documents. Bad web developers are everywhere, I just surveyed some new statistics that show 42% failure rate of Java-based web applications. It is not surprising that according to Sun Microsystems, 60% of java developers cannot pass their certification exams. The world wide web is really a very ugly place.
Posted by: ATP Search Experts | March 31, 2008 at 01:06 PM
Yes, many web developers still today incorrectly use metadata tags in their documents to try to "influence" search engine results in their favor. The web is 18 years old and still littered with bad web developers. What a shame. At least companies like Goooogle are making it easier to bypass the "junk" being produced by so many bad developers.
Posted by: MetaData Inc. | March 31, 2008 at 01:17 PM
Cool post and great points. Any thoughts on the combination and federation of these data sources. This seems to be the direction folks like Freebase are taking combining IMDB, Wikipedia, etc into a common database.
Posted by: Sean Gorman | March 31, 2008 at 01:29 PM
great post + i agree wholeheartedly with the point. adding more data is effectively a way to run part of your algorithm on human brain tissue :)
one point however: I believe the use of anchor text in search engine predates google. IIRC from "The Search" (book about google), lycos or excite had this as part of their algorithm pre-'98. Lots of interesting info in that book actually!
Posted by: chris marstall | March 31, 2008 at 02:13 PM
Kind of meta to the discussion, but my first reading of the comment by Andrew Parker ( http://anand.typepad.com/datawocky/2008/03/more-data-usual.html#comment-108606272 ) would be written like this:
"Many previous forms of defensibility have melted away: **parents** are completely disrespected and too costly to enforce"
So true!
Who knew claims about patents and parenthood could be so parallel.
Anyway, getting back to your post, Anand, enjoyed the insight! (itself an example of injecting good data into the blogospheric equation).
Posted by: M Wittman | March 31, 2008 at 02:27 PM
The pagerank algolrithm is based on these two 'innovations'as you call it.
1. The recognition that hyperlinks were an important measure of popularity -- a link to a webpage counts as a vote for it.
2. The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.
So pagerank is the algolrithm that utilizes these properties.... so your partially wrong.
Another thing: increasing the dataset does not increase the quality / correctness of your algolrithm. Algorithms do not know their input-set beforehand. So you argument stops at a certain point
Posted by: Tjerk Wolterink | March 31, 2008 at 02:52 PM
I wonder if one should make the distinction between data and information in the sense that more data to the same information would probably only reduce noise as mentioned in previous posts. Adding data with different information content would improve the results tremendously assuming one can establish context/relationships.
Posted by: Christian | March 31, 2008 at 05:16 PM
To first order, I think the IMBD information is helping to beat out the fancy algorithm because it's the product of a fancier human algorithm to categorize movies. But of course this begs the question of the quality of the information. Anyone who's tried to choose a sushi place based on online reviews knows what I'm talking about. There is a complex feedback loop between perceived quality and popularity, and I think this applies to many human activities (including currently popular medical and scientific ideas). Simply throwing more data at a problem is not always going to help -- it has to be "good" data.
But all that said I agree with the nib of the gist that the mystique of algorithms composed by egg-headed geniuses can overpower reality.
Posted by: Dan Gunter | March 31, 2008 at 10:57 PM
Figuring out which data will add insight is often the biggest challenge. In the post
http://blog.planeteye.com/2008/03/31/wikipedia-geohack
I provide an account on how traffic from Wikipedia gave us insight as to what future travel trends may be. The traditional approach to look at top destinations based on bookings may soon be dead.
Posted by: Juan Gonzalez | April 01, 2008 at 06:51 AM
If your experiment needs statistics, you ought to have done a better experiment - Ernest Rutherford
Posted by: Warren Focke | April 01, 2008 at 01:00 PM
FYI, You cannot use IMBD for the Netflix competition. I've read that a few places.
Posted by: Some Guy | April 01, 2008 at 01:08 PM
These findings reinforce the Rule of Representation from the Art of Unix Programming.
"Rule of Representation: Fold knowledge into data, so program logic can be stupid and robust."
http://www.faqs.org/docs/artu/ch01s06.html#id2878263
Posted by: Darcy Parker | April 01, 2008 at 01:28 PM
Doesn't the google example prove just the opposite? Search engines were all working with the same data set (the web) but Google's algorithm (or I guess Lycos's really) honed in on that small subsection of the data (the link structure) to produce the best results. If anything, in that case less data was better.
If we take a step back, machine learning became popular in A.I. partly because of the belief that to manually write all of the various facets of human knowledge/intelligence/common-sense/whatever is too large an undertaking. That hasn't stopped some from trying (see the Cyc project: http://www.kurzweilai.net/brain/frame.html?startThought=Cyc) but in general that assumption has gone unquestioned. Machine learning's promise was that we could seed computers with some basic knowledge and they could learn the rest themselves. Or even better, we could skip the hand crafting of knowledge altogether and point computers towards some experimental data source and have them learn away.
Perhaps what is going on now is that the pendulum is swinging the other way, and people are saying, hey, hand crafting some of this knowledge is not *that* hard (as imdb found out) and it provides really good results (as the stanford students who used that knowledge found out).
Posted by: Guray | April 01, 2008 at 03:05 PM
Yes, that is called "connectionism," to use the current name. The idea has been around, on the fringes of Artificial Intelligence, since the 1960's. Connectionism, taken to its purest form, is the idea that an algorithm good enough to account for human intelligence can be described in ten pages or so, and is essentially an undergraduate project, or possibly, at this date, even a high school science fair project. The large implication is that advanced research in Artificial Intelligence is essentially pointless. If something is complicated enough to be a reasonable dissertation project, then it is too complicated to have evolved in biological terms. Well-understood biological mechanisms (eg. osteoclasts and osteoblasts) have a "beautiful simplicity," and in terms of evolutionary time, there simply wasn't very much time for human intelligence to evolve, a mere million years or so. The hard AI people have long been deluding themselves about the human brain's actual performance, data storage, etc, which is a lot higher than they think. Stanley L. Jaki pointed out this awkward fact, in his _Brain, Mind, and Computers_, back in 1969, but the hard AI types naturally didn't want to hear it. They insisted that if they just wrote the right code, everything would work.
There was an interesting project some years ago in natural language translation, which involved using a large corpus, the bilingual proceedings of the Canadian parliament. The results with a comparatively simple algorithm and a comparatively large body of data were about as good as those with complex algorithms.
This has further implications for Computer Science. Nowadays, kids grow up with computers, and the standard curricula are effectively geared to someone with very little intellectual curiosity in terms of his opportunities. One is continually meeting these twelve and fourteen-year-olds, who are working with computers at an advanced level, and, from what I can tell, they seem to be majoring in something else. What do you teach someone who has written a compiler before he is old enough to go to college?
Posted by: Andrew D. Todd | April 01, 2008 at 07:36 PM
The field of algorithmic information theory has been familiar with this result in the broadest general case in mathematical literature since the early to mid-1990s (they even have a special name for the non-trivial relationship). That said, it is not casual math and many aspects of it have the property of non-computability or at least intractability, but it can be dumbed down to a few key elements that at least capture a flavor of the simple cases and allow one to discuss things like the limits of prediction and the relationship of data set characteristics to those limits.
Nonetheless, even though that math is atypically familiar to me I am a bit surprised that ten or fifteen years later almost everyone else is still surprised. I would have thought this would have at least filtered into the computer science consciousness by now...
Posted by: Andrew | April 01, 2008 at 07:48 PM