The post More Data Beats Better Algorithms generated a lot of interest and comments. Since there are too many comments to address individually, I'm addressing some of them in this post.
1. Why should we have to choose between data and algorithms? Why not have more data and better algorithms?
A. There are two parts to the answer. The first is resource limits; if, for example, you are a startup that has to choose where to invest resources -- in gathering or licensing data, or hiring the next PhD -- then I think beyond a certain level you are better off getting more data. If you a startup that doesn't need to make that choice, more power to you.
The second reason is more subtle, and has to do with the structure and computational complexity of algorithms. To a first approximation, simple data mining algorithms (e.g., page rank; market baskets; clustering) take time roughly linear in the size of their input data, while more complicated machine learning algorithms take time that is quadratic or even cubic in the size of their input data (e.g., Support Vector Machines; Singular Value Decomposition). As a simple rule of thumb, simple linear algorithms scale to large data sets (hundreds of gigabytes to terabytes), while quadratic and cubic algorithms cannot scale and therefore need some data size reduction as a preprocessing step.
For those familiar with Computer Science basics, scalable algorithms involve only a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce. Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.
2. Does your argument hold only for the addition of an additional independent data set (as in the IMDB/Netflix case), or does it hold also when we add more of the same kind of data (e.g., more user ratings in the case of Netflix)?
Adding independent data usually makes a huge difference. For example, in the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR data was an independent data set from the bid data. And Google went even a step further: they became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.
You might think that it won't help much to add more of the same data, because diminishing returns would set in. This is true in some cases; however, in many important cases, adding more of the same data makes a bigger difference than you'd think. These are cases where where the application sees the data embedded in a very-high-dimensional space. For example, in the case of Netflix, we can think of dimensions as users (or movies); and in the case of web search, the dimensions are k-grams of terms.
In such cases, the available data is usually very sparse and has a long-tailed distribution. For example, in the Netflix example, we have ratings available for less than 1% of all movie-user pairs. We can add lots of ratings data before diminishing returns becomes a real issue -- say add 10 times more. In such cases, adding lots more of the same kind of data makes a big difference.
3. Can you provide more specifics about the algorithms used in the student projects?
I don't plan to do so at this time, because I don't have permission from the student teams and don't wish to jeopardize their chances in the Netflix competition.
The boys at Carnegie Mellon University thought of this last year. Check out the google tech talk:
http://video.google.com/videoplay?docid=-8639996003880499413
The underlying theory is, that if you have enough data points, then the method you use to interpolate missing data points becomes less important, because in the limiting case all reasonable similarity metrics become equivalent.
Posted by: gfyffe | April 03, 2008 at 02:45 PM
Re: "More data AND better algorithms: The second reason is more subtle, and has to do with the structure and computational complexity of algorithms."
I think you're making an incorrect assumption here. You're assuming that "better" algorithms means more "(computationally) complex" algorithms.
As one who asks this "better data and better algorithms" question, I am not referring to computational complexity. I am referring to more intelligence built into the algorithm itself, to take advantage of the domain to which the algorithm is being applied.
For example, let's take a domain that I have done a lot of work in: Music Information Retrieval (Music Search). In this domain, there are many people working with pure machine learning-driven algorithms, where you have no idea what the underlying data really is or really means, and are just applying SVMs or GMMs to raw PCM/waveform data (zero crossings, spectral flux, etc.) in order to do things like music genre classification.
A more intelligent algorithm, on the other hand, would be one that makes use of the musicality/musical nature of the signal itself. I.e. it does the following steps:
(1) It segments the raw audio data into beat/onsets
(2) It looks for regularities in the beat/onset data to infer tempo
(3) It looks for harmonic changes at various beat/onset points to infer metric structure (i.e. whether a song is 3/4, 4/4 or 11/16 time signature)
(4) It then feeds this intelligence into an SVM.
Now, with machine learning algorithms specifically tuned to the underlying data, i.e. made more intelligent, it is much easier to tell a waltz (3/4 time) apart from a Bulgarian Kopanitsa (11/16).
I am not actually talking about having more data, here. More data would be knowing, for sure, the time signature of a piece of music. A better algorithm, on the other hand, is all about making better use of the underlying structure of the data in a more intelligence manner. A better algorithm realizes that the data itself is musical data, and that musical data has certain properties (such as tempos and rhythms) and can (and should) be processed in certain ways and then puts a certain level of intelligence into the algorithms that make use of that data, by inferring those tempos and rhythms.
This processing does not necessarily add computational complexity to the overall algorithm. Yes, it might add a constant time factor increase. But O(n^2) for the musical beat and rhythm processing plus O(n^2) for the SVM is still O(n^2).
So again, I am interested in more discussion around this better data AND better algorithms topic. Other than limited resources, why not do both? Especially when you can do better algorithms without increasing the computational complexity.
Asked another way: Might not a 20% increase in data plus a 20% increase in algorithmic intelligence yield better results than a 40% increase in data?
Posted by: jeremy | April 03, 2008 at 04:11 PM
Hi Jeremy, your music search example shows that domain knowledge plays a great role in applications of ML/DM algorithms. In some sense, one can view "domain knowledge" as a kind of "data".
Posted by: anon | April 04, 2008 at 12:31 AM
@jeremy
As much as I think in theory there is some validity to your point, in practice it does not work out.
I went to a seminar hosted by a UMD professor dealing with the redistricting of Howard Co. schools. The computational complexity of the project is NP so it was absolutely out of the question to find a way to redistrict all 263 neighborhoods amongst 12 highschools, and even more middle schools and elementary school in the county. Needless to say, she required a heuristic to derive a set of plans to propose.
Now, the easy thing to do would be to aggregate feedback from a selection of parents, teachers, bus drivers...etc and use this information in her heuristic. Did she? No. Instead she kept messing with the data she *already had* instead. The result? A lot of wasted effort attempting to create the "right guess".
I look at more data as being a means of utilizing human intellect. And to assume that resource is unavailable to you is... unfair to us humans :-P
With that previous example, a cheap polling site, a good sample of participants, and an email list would have made her heuristics more accurately reflect the desires and true goals of those who live in the county. Instead, she has a lot of shallow statistics dealing with busing capacity and low income student meals... I mean.. c'mon.
That's why you never hire a PhD. They're only good at making the most complex answer because that's what they base their income on... coming up with new complex answers to problems that are often easily solved by using a little human intellect.
Posted by: Collin Cusce | April 04, 2008 at 02:06 AM
anon writes: "In some sense, one can view "domain knowledge" as a kind of "data"."
I suppose you could look at it this way. But I also suppose you can look at domain knowledge as a way of crafting better algorithms.
Take a Markov Random Field, for example. In the absence of an actual domain, a naive machine learning algorithm would be based on a fully-connected graph, with all hidden/inferred states connected to all other hidden/inferred states.
If that is what you've got, then of course more data is going to improve upon your naivete. You're going to need more data simply to deal with the fact that you have so many parameters to fit.
On the other hand, if you know that there are certain states between which a transition or dependency will never take place, then you can remove those edges from the graph. It will then take less data to properly train this simpler graph.
So that process of removing an edge.. you would call that "having more data"? I would call that "having a better algorithm".
And note that the computational complexity is actually lower for this smarter algorithm, than for the original, naive algorithm.
Now, Collin makes a good point in that not every single problem lends itself to smarter algorithms. Or, even if the algorithms could be improved, certain UMD Professors might not be capable of doing so. Sometimes it's just better to go out and get more data.
So I do not categorically believe that better algorithms will beat more data. Depending on the domain, sometimes it might, sometimes it might not. I am just reacting against the belief that, if given the choice between better algorithms and better data, that better data will always win. I just don't see how this is true.
Posted by: jeremy | April 04, 2008 at 11:30 AM
Great posts Anand! I've subscribed to your blog.
Quick question for you: can you recommend a couple of books to get me started with data mining? I've got a degree in computer engineering and I'm an experienced programmer. I know some of the ideas involved in data mining, but I have no formal education in it. Are there any books that assume computer science knowledge, start with basics, but cover advanced topics as well?
Posted by: ethan | April 05, 2008 at 11:02 AM
Ethan: There's a recent book, "Programming Collective Intelligence" by Toby Segaran, that is one of the better basic intros to applied machine learning that I've seen. It assumes basic CS and not much else, and does a good job explaining various algorithms and problems. Not sure if it's the best fit for "data mining" but might be worth a look...
http://www.oreilly.com/catalog/9780596529321/
Posted by: Brendan O'Connor | April 12, 2008 at 01:17 AM
@Ethan: I don't know personally of a great book on data mining -- I don't recommend one for the students in my Stanford class. Instead we use the lecture notes from classes with additional readings, listed on the course web site.
http://www.stanford.edu/class/cs345a
That said, I did a quick check of the book Brendan O'Connor recommends and it looks promising. Thanks Brendan!
Posted by: Anand Rajaraman | April 14, 2008 at 10:27 PM
Thanks Brendan and Anand, I'll start there.
Posted by: ethan | April 18, 2008 at 02:19 PM
You should check out Peter Norivg's talk on "Theorizing from Data".
He also underlines the importance of data over better algorithms, and illustrates his point with a number of measurements from Google projects.
http://www.youtube.com/watch?v=nU8DcBF-qo4
Posted by: Julien | April 25, 2008 at 09:50 PM
Comments about bad scaling should be taken with a bit of a grain of salt. In practice, sparsity and diminishing returns make apparently infeasible algorithms much easier.
SVD was cited as an example with poor scaling properties. Indeed, a naive implementation has O(n^3) complexity or worse due to the presence of matrix multiplication. In practice, the multiplications needed can be approximated by random walks or by sparse matrix multiplication (with cost of about O(n)) and only a few singular values/vectors are useful (the rest are basically noise). The result is a roughly O(n) algorithm.
Another example is clustering along the lines of k-means. Again, the complexity looks daunting, but in practice it is much better because you can find the large clusters in a quick pass over a small fraction of the data, use a larger fraction to work out smaller clusters and then make a single pass over most of the data. Again, this is effectively O(k n) which isn't all that bad. You can even use the last pass to kick out the few points that appear not to be well enough modeled by the known clusters (sampling!) and then use those to update the clustering. This has essentially no impact on scaling because the number of data points not well modeled quickly becomes minute and you can do anything you like to them without impacting scaling.
So the summary should be that "better algorithms with good implementations work even better than silly implementations of naive algorithms".
Posted by: Ted Dunning | June 23, 2008 at 12:00 PM