Mayank Bawa over at the Aster Data blog has posted a great riff on one of my favorite themes: using simple algorithms to analyze large volumes of data rather than more sophisticated algorithms that cannot scale to large datasets.
Often, we have a really cool algorithm (say, support-vector machines or singular value methods) that works only on main-memory datasets. In such cases, the only possibility is to reduce the data set to a manageable size through sampling. Mayank's post illustrates the dangers of such sampling: in businesses such as advertising, sampling can make the pattern you're trying to extract so weak that even the more powerful algorithm cannot pick it up.
For example, say 0.1% of users exhibit a certain kind of behavior. If you start with 100 million users, and then take a 1% sample, you might think you are OK because you still have 1 million users in your sample. But now just 1000 users in the sample exhibit the desired behavior, which may be too small for any algorithm to pick apart from the noise. In fact, it's likely to be below the support thresholds of most algorithms. The problem is, these 0.1% of users might represent a big unknown revenue opportunity.
Moral of the story: use the entire data set, even it if it is many terabytes. If your algorithm cannot handle a dataset that large, then change the algorithm, not the dataset.
I am agree with you. Data sampling is evil in most cases.
I had similar experience working on automatic geo-tagging for russian web sites. My algorithms worked well on different samples but only after running them on huge dataset of 100 thousands websites it become possible to detect some recurring classification errors and to get real picture.
In same time I would like to add that for some of data processing tasks like classsification slicing of dataset could be used.
Posted by: Ivan Begtin | June 13, 2008 at 12:33 AM
Interesting point of view Anand.
I guess the decision depends on the business problem you need to resolve.
Many times you need to actualize a certain model or deploy a quick result (churn, forecast). In these scenarios the speed is an important fact, and even with a simple model, you really need to use a sample and take advantage of the Law of large numbers (LLN). Believe me, It's a real pain to deal with terabytes, and is a real blessing when you can deal with significant samples.
I guess for exploratory analysis your point is absolutely true. You should prefer simple models instead or something more complex, just to explore what we can find inside the data (understand instead of get a good precision).
Congrats for your blog.
Luis
Posted by: Luis Aburto | June 13, 2008 at 12:28 PM
Luis: Yes, it's a pain to work with terabytes. But often this is what you have to do in order to go beyond the obvious and get real insights that provide a competitive advantage. Often sampling throws the baby out with the bathwater.
Map Reduce (e.g., Hadoop) and new database systems like Aster Data hold the promise of making it simpler to deal with terabytes.
Posted by: Anand Rajaraman | June 14, 2008 at 11:27 AM
Sampling is more effective than you make is sound. Blind sampling has a good chance of giving poor results, but there are many ways to avoid that trap.
In particular, simple stratification works very well and has a very long track record in the fraud detection world starting back when computers were much, much smaller than now. Text classification is another area where stratification works wonders.
I would say that doctrinaire application of any prejudice (even yours) is what is really harmful. You have to be flexible to build real systems. That may mean scaling with simple algorithms, or it may mean sampling. Or it may mean both. Or neither.
Posted by: Ted Dunning | June 23, 2008 at 08:53 AM
Both singular value decomposition (SVD) and support vector machines (SVM) can be solved online with stochastic gradient descent (SGD). The beauty of SGD is that it only needs a single data instance in memory at a time.
Good starting points are Leon Bottou's SGD package and his talk on large scale learning, and John Langford's Vowpal Wabbit SGD package.
Posted by: Bob Carpenter | October 13, 2008 at 03:20 PM