The post More Data Beats Better Algorithms generated a lot of interest and comments. Since there are too many comments to address individually, I'm addressing some of them in this post.
1. Why should we have to choose between data and algorithms? Why not have more data and better algorithms?
A. There are two parts to the answer. The first is resource limits; if, for example, you are a startup that has to choose where to invest resources -- in gathering or licensing data, or hiring the next PhD -- then I think beyond a certain level you are better off getting more data. If you a startup that doesn't need to make that choice, more power to you.
The second reason is more subtle, and has to do with the structure and computational complexity of algorithms. To a first approximation, simple data mining algorithms (e.g., page rank; market baskets; clustering) take time roughly linear in the size of their input data, while more complicated machine learning algorithms take time that is quadratic or even cubic in the size of their input data (e.g., Support Vector Machines; Singular Value Decomposition). As a simple rule of thumb, simple linear algorithms scale to large data sets (hundreds of gigabytes to terabytes), while quadratic and cubic algorithms cannot scale and therefore need some data size reduction as a preprocessing step.
For those familiar with Computer Science basics, scalable algorithms involve only a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce. Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.
2. Does your argument hold only for the addition of an additional independent data set (as in the IMDB/Netflix case), or does it hold also when we add more of the same kind of data (e.g., more user ratings in the case of Netflix)?
Adding independent data usually makes a huge difference. For example, in the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR data was an independent data set from the bid data. And Google went even a step further: they became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.
You might think that it won't help much to add more of the same data, because diminishing returns would set in. This is true in some cases; however, in many important cases, adding more of the same data makes a bigger difference than you'd think. These are cases where where the application sees the data embedded in a very-high-dimensional space. For example, in the case of Netflix, we can think of dimensions as users (or movies); and in the case of web search, the dimensions are k-grams of terms.
In such cases, the available data is usually very sparse and has a long-tailed distribution. For example, in the Netflix example, we have ratings available for less than 1% of all movie-user pairs. We can add lots of ratings data before diminishing returns becomes a real issue -- say add 10 times more. In such cases, adding lots more of the same kind of data makes a big difference.
3. Can you provide more specifics about the algorithms used in the student projects?
I don't plan to do so at this time, because I don't have permission from the student teams and don't wish to jeopardize their chances in the Netflix competition.