« How Google Measures Search Quality | Main | India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Ivan Begtin

I am agree with you. Data sampling is evil in most cases.

I had similar experience working on automatic geo-tagging for russian web sites. My algorithms worked well on different samples but only after running them on huge dataset of 100 thousands websites it become possible to detect some recurring classification errors and to get real picture.

In same time I would like to add that for some of data processing tasks like classsification slicing of dataset could be used.

Luis Aburto

Interesting point of view Anand.
I guess the decision depends on the business problem you need to resolve.

Many times you need to actualize a certain model or deploy a quick result (churn, forecast). In these scenarios the speed is an important fact, and even with a simple model, you really need to use a sample and take advantage of the Law of large numbers (LLN). Believe me, It's a real pain to deal with terabytes, and is a real blessing when you can deal with significant samples.

I guess for exploratory analysis your point is absolutely true. You should prefer simple models instead or something more complex, just to explore what we can find inside the data (understand instead of get a good precision).

Congrats for your blog.

Anand Rajaraman

Luis: Yes, it's a pain to work with terabytes. But often this is what you have to do in order to go beyond the obvious and get real insights that provide a competitive advantage. Often sampling throws the baby out with the bathwater.

Map Reduce (e.g., Hadoop) and new database systems like Aster Data hold the promise of making it simpler to deal with terabytes.

Ted Dunning

Sampling is more effective than you make is sound. Blind sampling has a good chance of giving poor results, but there are many ways to avoid that trap.

In particular, simple stratification works very well and has a very long track record in the fraud detection world starting back when computers were much, much smaller than now. Text classification is another area where stratification works wonders.

I would say that doctrinaire application of any prejudice (even yours) is what is really harmful. You have to be flexible to build real systems. That may mean scaling with simple algorithms, or it may mean sampling. Or it may mean both. Or neither.

Bob Carpenter

Both singular value decomposition (SVD) and support vector machines (SVM) can be solved online with stochastic gradient descent (SGD). The beauty of SGD is that it only needs a single data instance in memory at a time.

Good starting points are Leon Bottou's SGD package and his talk on large scale learning, and John Langford's Vowpal Wabbit SGD package.

The comments to this entry are closed.