« Traveling: In India this week | Main | Affinity and Herding Determine the Effectiveness of Social Media Advertising »


Feed You can follow this conversation by subscribing to the comment feed for this post.


The boys at Carnegie Mellon University thought of this last year. Check out the google tech talk:


The underlying theory is, that if you have enough data points, then the method you use to interpolate missing data points becomes less important, because in the limiting case all reasonable similarity metrics become equivalent.


Re: "More data AND better algorithms: The second reason is more subtle, and has to do with the structure and computational complexity of algorithms."

I think you're making an incorrect assumption here. You're assuming that "better" algorithms means more "(computationally) complex" algorithms.

As one who asks this "better data and better algorithms" question, I am not referring to computational complexity. I am referring to more intelligence built into the algorithm itself, to take advantage of the domain to which the algorithm is being applied.

For example, let's take a domain that I have done a lot of work in: Music Information Retrieval (Music Search). In this domain, there are many people working with pure machine learning-driven algorithms, where you have no idea what the underlying data really is or really means, and are just applying SVMs or GMMs to raw PCM/waveform data (zero crossings, spectral flux, etc.) in order to do things like music genre classification.

A more intelligent algorithm, on the other hand, would be one that makes use of the musicality/musical nature of the signal itself. I.e. it does the following steps:

(1) It segments the raw audio data into beat/onsets
(2) It looks for regularities in the beat/onset data to infer tempo
(3) It looks for harmonic changes at various beat/onset points to infer metric structure (i.e. whether a song is 3/4, 4/4 or 11/16 time signature)
(4) It then feeds this intelligence into an SVM.

Now, with machine learning algorithms specifically tuned to the underlying data, i.e. made more intelligent, it is much easier to tell a waltz (3/4 time) apart from a Bulgarian Kopanitsa (11/16).

I am not actually talking about having more data, here. More data would be knowing, for sure, the time signature of a piece of music. A better algorithm, on the other hand, is all about making better use of the underlying structure of the data in a more intelligence manner. A better algorithm realizes that the data itself is musical data, and that musical data has certain properties (such as tempos and rhythms) and can (and should) be processed in certain ways and then puts a certain level of intelligence into the algorithms that make use of that data, by inferring those tempos and rhythms.

This processing does not necessarily add computational complexity to the overall algorithm. Yes, it might add a constant time factor increase. But O(n^2) for the musical beat and rhythm processing plus O(n^2) for the SVM is still O(n^2).

So again, I am interested in more discussion around this better data AND better algorithms topic. Other than limited resources, why not do both? Especially when you can do better algorithms without increasing the computational complexity.

Asked another way: Might not a 20% increase in data plus a 20% increase in algorithmic intelligence yield better results than a 40% increase in data?


Hi Jeremy, your music search example shows that domain knowledge plays a great role in applications of ML/DM algorithms. In some sense, one can view "domain knowledge" as a kind of "data".

Collin Cusce


As much as I think in theory there is some validity to your point, in practice it does not work out.

I went to a seminar hosted by a UMD professor dealing with the redistricting of Howard Co. schools. The computational complexity of the project is NP so it was absolutely out of the question to find a way to redistrict all 263 neighborhoods amongst 12 highschools, and even more middle schools and elementary school in the county. Needless to say, she required a heuristic to derive a set of plans to propose.

Now, the easy thing to do would be to aggregate feedback from a selection of parents, teachers, bus drivers...etc and use this information in her heuristic. Did she? No. Instead she kept messing with the data she *already had* instead. The result? A lot of wasted effort attempting to create the "right guess".

I look at more data as being a means of utilizing human intellect. And to assume that resource is unavailable to you is... unfair to us humans :-P

With that previous example, a cheap polling site, a good sample of participants, and an email list would have made her heuristics more accurately reflect the desires and true goals of those who live in the county. Instead, she has a lot of shallow statistics dealing with busing capacity and low income student meals... I mean.. c'mon.

That's why you never hire a PhD. They're only good at making the most complex answer because that's what they base their income on... coming up with new complex answers to problems that are often easily solved by using a little human intellect.


anon writes: "In some sense, one can view "domain knowledge" as a kind of "data"."

I suppose you could look at it this way. But I also suppose you can look at domain knowledge as a way of crafting better algorithms.

Take a Markov Random Field, for example. In the absence of an actual domain, a naive machine learning algorithm would be based on a fully-connected graph, with all hidden/inferred states connected to all other hidden/inferred states.

If that is what you've got, then of course more data is going to improve upon your naivete. You're going to need more data simply to deal with the fact that you have so many parameters to fit.

On the other hand, if you know that there are certain states between which a transition or dependency will never take place, then you can remove those edges from the graph. It will then take less data to properly train this simpler graph.

So that process of removing an edge.. you would call that "having more data"? I would call that "having a better algorithm".

And note that the computational complexity is actually lower for this smarter algorithm, than for the original, naive algorithm.

Now, Collin makes a good point in that not every single problem lends itself to smarter algorithms. Or, even if the algorithms could be improved, certain UMD Professors might not be capable of doing so. Sometimes it's just better to go out and get more data.

So I do not categorically believe that better algorithms will beat more data. Depending on the domain, sometimes it might, sometimes it might not. I am just reacting against the belief that, if given the choice between better algorithms and better data, that better data will always win. I just don't see how this is true.


Great posts Anand! I've subscribed to your blog.

Quick question for you: can you recommend a couple of books to get me started with data mining? I've got a degree in computer engineering and I'm an experienced programmer. I know some of the ideas involved in data mining, but I have no formal education in it. Are there any books that assume computer science knowledge, start with basics, but cover advanced topics as well?

Brendan O'Connor

Ethan: There's a recent book, "Programming Collective Intelligence" by Toby Segaran, that is one of the better basic intros to applied machine learning that I've seen. It assumes basic CS and not much else, and does a good job explaining various algorithms and problems. Not sure if it's the best fit for "data mining" but might be worth a look...


Anand Rajaraman

@Ethan: I don't know personally of a great book on data mining -- I don't recommend one for the students in my Stanford class. Instead we use the lecture notes from classes with additional readings, listed on the course web site.


That said, I did a quick check of the book Brendan O'Connor recommends and it looks promising. Thanks Brendan!


Thanks Brendan and Anand, I'll start there.


You should check out Peter Norivg's talk on "Theorizing from Data".
He also underlines the importance of data over better algorithms, and illustrates his point with a number of measurements from Google projects.


Ted Dunning

Comments about bad scaling should be taken with a bit of a grain of salt. In practice, sparsity and diminishing returns make apparently infeasible algorithms much easier.

SVD was cited as an example with poor scaling properties. Indeed, a naive implementation has O(n^3) complexity or worse due to the presence of matrix multiplication. In practice, the multiplications needed can be approximated by random walks or by sparse matrix multiplication (with cost of about O(n)) and only a few singular values/vectors are useful (the rest are basically noise). The result is a roughly O(n) algorithm.

Another example is clustering along the lines of k-means. Again, the complexity looks daunting, but in practice it is much better because you can find the large clusters in a quick pass over a small fraction of the data, use a larger fraction to work out smaller clusters and then make a single pass over most of the data. Again, this is effectively O(k n) which isn't all that bad. You can even use the last pass to kick out the few points that appear not to be well enough modeled by the known clusters (sampling!) and then use those to update the clustering. This has essentially no impact on scaling because the number of data points not well modeled quickly becomes minute and you can do anything you like to them without impacting scaling.

So the summary should be that "better algorithms with good implementations work even better than silly implementations of naive algorithms".

The comments to this entry are closed.