« Enumerating User Data Collection Points | Main | Traveling: In India this week »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Abhishek Gattani

Insightful post. Couple of points:

1) Fine-tuning an algorithm to work with existing data also runs the risk that new data wont be accepted even if its available(the issue of scale).

2) More data also leads to less noise if it’s random (averaging).

3) Of-course data when not selected judiciously can also introduce bias. E.g. if netflix and imdb data represents the US demographic it wont really apply well to an Asian demographic. Data selection almost deserves a post of its own.

4) Algorithms can discover what they don’t know and then actively seek data to bridge the gap (learn actively not passively)

Nice read


Great post!! I've subscribed to your blog :)

It's the most powerful technique one can use to improve the performance of their machine learning algorithms and it's also the most over looked.

A caution is necessary though. As much as links and anchor text have helped Google gain an edge one can argue another piece of data 'meta tags' haven't helped search relevance. All data is not useful but independent data is usually useful. In the cited example one can argue that IMDB and Netflix are independent sources of data and that helped the cause more than more of Netflix data.


Totally agree; learning over webscale data generally improves with more independent signals --- key is to find as many high coverage data sources as available.

Andrew Parker

Data is the new defensibility. Many previous forms of defensibility have melted away: patents are completely disrespected and too costly to enforce, technology and algorithms can be reverse engineered by 2 students with a case of Raman and Red Bull in a month, many companies are using open source technology that's completely commodity, and developing a web service is so capital efficient that out-fundraising your competition is nothing more than a distraction.

But, data is still a defensible advantage. Get more data than your competition, get proprietary data that no one else can access, generate your own implicit usage data... that's the remaining defensibility in the web services market today.

Ilya Grigorik

Recently ran into the exact same experience. Brainstormed half-dozen approaches for a problem, set high expectations for most complex, but thankfully had enough smarts to try the simplest possible approach as a baseline first. To my own surprise (happens almost every time), the marginal gains of the complex algorithm were definitely not worth the effort!

Glen Coppersmith

Great post, and I've subscribed and linked to your blog.

I have been trying to find the careful balance between 'additional data' and 'curse of dimensionality' in my work, and I find it's more of a razor-wire than a tightrope.

Do you have any papers (that you authored or have read) on the subject that you recommend?


John Stracke

When I took machine learning last fall, a group of us worked on the Netflix data for our class project. We compared various learning algorithms: three or four that we'd covered in the course, plus a naïve one I made up for comparison. The sophisticated algorithms just used the Netflix data directly; the naïve one joined with the IMDB genre list, and said "film F is genre G; your average preference for genre G is P, so I'm going to predict your preference for F is P". The naïve algorithm beat the sophisticated ones hands down.

Nathan Kurz

Hello Anand ---

I'm not a Netflix contest competitor, but I've been following it fairly closely since its start. To my knowledge, none of the current leaders are using external genre information. Thus I'm surprised by your conclusion that "adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set."

To the contrary, one of the current leaders concludes the opposite: "Of course, weak collaborative-filtering (CF) models can benefit from such IMDB information, but our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset." (Yehuda Koren, http://glinden.blogspot.com/2008/03/using-imdb-data-for-netflix-prize.html?showComment=1206557280000#c4525777289161540480)

Perhaps offering more specifics about the approaches and scores would make this clearer? And if you happen to have a IMDB-Netflix mapping, I personally would love to have a copy of it to play around with. I've been meaning to make one myself, but haven't yet found the time.


Nathan Kurz
[email protected]

Ron Garret

Nice post. Just a minor correction for the record. The first release of AdWords charged a fixed rate cost-per-impression, not cost-per-click. There was no bidding. CTR was the only factor that affected the selection of which ads were displayed for a given query.


Interesting claim about Google's algorithm not being particularly special. Do you have data to back up the following claim:

"The PageRank algorithm itself is a minor detail -- any halfway decent algorithm that exploited this additional data would have produced roughly comparable results."

ATP Search Experts

Meta Data in web pages has been ignored by search engines for many years. I am still surprised at how many incompetent web developers still put that useless garbage in all there documents. Bad web developers are everywhere, I just surveyed some new statistics that show 42% failure rate of Java-based web applications. It is not surprising that according to Sun Microsystems, 60% of java developers cannot pass their certification exams. The world wide web is really a very ugly place.

MetaData Inc.

Yes, many web developers still today incorrectly use metadata tags in their documents to try to "influence" search engine results in their favor. The web is 18 years old and still littered with bad web developers. What a shame. At least companies like Goooogle are making it easier to bypass the "junk" being produced by so many bad developers.

Sean Gorman

Cool post and great points. Any thoughts on the combination and federation of these data sources. This seems to be the direction folks like Freebase are taking combining IMDB, Wikipedia, etc into a common database.

chris marstall

great post + i agree wholeheartedly with the point. adding more data is effectively a way to run part of your algorithm on human brain tissue :)

one point however: I believe the use of anchor text in search engine predates google. IIRC from "The Search" (book about google), lycos or excite had this as part of their algorithm pre-'98. Lots of interesting info in that book actually!

M Wittman

Kind of meta to the discussion, but my first reading of the comment by Andrew Parker ( http://anand.typepad.com/datawocky/2008/03/more-data-usual.html#comment-108606272 ) would be written like this:

"Many previous forms of defensibility have melted away: **parents** are completely disrespected and too costly to enforce"

So true!

Who knew claims about patents and parenthood could be so parallel.

Anyway, getting back to your post, Anand, enjoyed the insight! (itself an example of injecting good data into the blogospheric equation).

Tjerk Wolterink

The pagerank algolrithm is based on these two 'innovations'as you call it.

1. The recognition that hyperlinks were an important measure of popularity -- a link to a webpage counts as a vote for it.
2. The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.

So pagerank is the algolrithm that utilizes these properties.... so your partially wrong.

Another thing: increasing the dataset does not increase the quality / correctness of your algolrithm. Algorithms do not know their input-set beforehand. So you argument stops at a certain point


I wonder if one should make the distinction between data and information in the sense that more data to the same information would probably only reduce noise as mentioned in previous posts. Adding data with different information content would improve the results tremendously assuming one can establish context/relationships.

Dan Gunter

To first order, I think the IMBD information is helping to beat out the fancy algorithm because it's the product of a fancier human algorithm to categorize movies. But of course this begs the question of the quality of the information. Anyone who's tried to choose a sushi place based on online reviews knows what I'm talking about. There is a complex feedback loop between perceived quality and popularity, and I think this applies to many human activities (including currently popular medical and scientific ideas). Simply throwing more data at a problem is not always going to help -- it has to be "good" data.

But all that said I agree with the nib of the gist that the mystique of algorithms composed by egg-headed geniuses can overpower reality.

Juan Gonzalez

Figuring out which data will add insight is often the biggest challenge. In the post


I provide an account on how traffic from Wikipedia gave us insight as to what future travel trends may be. The traditional approach to look at top destinations based on bookings may soon be dead.

Warren Focke

If your experiment needs statistics, you ought to have done a better experiment - Ernest Rutherford

Some Guy

FYI, You cannot use IMBD for the Netflix competition. I've read that a few places.

Darcy Parker

These findings reinforce the Rule of Representation from the Art of Unix Programming.

"Rule of Representation: Fold knowledge into data, so program logic can be stupid and robust."


Doesn't the google example prove just the opposite? Search engines were all working with the same data set (the web) but Google's algorithm (or I guess Lycos's really) honed in on that small subsection of the data (the link structure) to produce the best results. If anything, in that case less data was better.

If we take a step back, machine learning became popular in A.I. partly because of the belief that to manually write all of the various facets of human knowledge/intelligence/common-sense/whatever is too large an undertaking. That hasn't stopped some from trying (see the Cyc project: http://www.kurzweilai.net/brain/frame.html?startThought=Cyc) but in general that assumption has gone unquestioned. Machine learning's promise was that we could seed computers with some basic knowledge and they could learn the rest themselves. Or even better, we could skip the hand crafting of knowledge altogether and point computers towards some experimental data source and have them learn away.

Perhaps what is going on now is that the pendulum is swinging the other way, and people are saying, hey, hand crafting some of this knowledge is not *that* hard (as imdb found out) and it provides really good results (as the stanford students who used that knowledge found out).

Andrew D. Todd

Yes, that is called "connectionism," to use the current name. The idea has been around, on the fringes of Artificial Intelligence, since the 1960's. Connectionism, taken to its purest form, is the idea that an algorithm good enough to account for human intelligence can be described in ten pages or so, and is essentially an undergraduate project, or possibly, at this date, even a high school science fair project. The large implication is that advanced research in Artificial Intelligence is essentially pointless. If something is complicated enough to be a reasonable dissertation project, then it is too complicated to have evolved in biological terms. Well-understood biological mechanisms (eg. osteoclasts and osteoblasts) have a "beautiful simplicity," and in terms of evolutionary time, there simply wasn't very much time for human intelligence to evolve, a mere million years or so. The hard AI people have long been deluding themselves about the human brain's actual performance, data storage, etc, which is a lot higher than they think. Stanley L. Jaki pointed out this awkward fact, in his _Brain, Mind, and Computers_, back in 1969, but the hard AI types naturally didn't want to hear it. They insisted that if they just wrote the right code, everything would work.

There was an interesting project some years ago in natural language translation, which involved using a large corpus, the bilingual proceedings of the Canadian parliament. The results with a comparatively simple algorithm and a comparatively large body of data were about as good as those with complex algorithms.

This has further implications for Computer Science. Nowadays, kids grow up with computers, and the standard curricula are effectively geared to someone with very little intellectual curiosity in terms of his opportunities. One is continually meeting these twelve and fourteen-year-olds, who are working with computers at an advanced level, and, from what I can tell, they seem to be majoring in something else. What do you teach someone who has written a compiler before he is old enough to go to college?


The field of algorithmic information theory has been familiar with this result in the broadest general case in mathematical literature since the early to mid-1990s (they even have a special name for the non-trivial relationship). That said, it is not casual math and many aspects of it have the property of non-computability or at least intractability, but it can be dumbed down to a few key elements that at least capture a flavor of the simple cases and allow one to discuss things like the limits of prediction and the relationship of data set characteristics to those limits.

Nonetheless, even though that math is atypically familiar to me I am a bit surprised that ten or fifteen years later almost everyone else is still surprised. I would have thought this would have at least filtered into the computer science consciousness by now...

The comments to this entry are closed.