Mayank Bawa over at the Aster Data blog has posted a great riff on one of my favorite themes: using simple algorithms to analyze large volumes of data rather than more sophisticated algorithms that cannot scale to large datasets.
Often, we have a really cool algorithm (say, support-vector machines or singular value methods) that works only on main-memory datasets. In such cases, the only possibility is to reduce the data set to a manageable size through sampling. Mayank's post illustrates the dangers of such sampling: in businesses such as advertising, sampling can make the pattern you're trying to extract so weak that even the more powerful algorithm cannot pick it up.
For example, say 0.1% of users exhibit a certain kind of behavior. If you start with 100 million users, and then take a 1% sample, you might think you are OK because you still have 1 million users in your sample. But now just 1000 users in the sample exhibit the desired behavior, which may be too small for any algorithm to pick apart from the noise. In fact, it's likely to be below the support thresholds of most algorithms. The problem is, these 0.1% of users might represent a big unknown revenue opportunity.
Moral of the story: use the entire data set, even it if it is many terabytes. If your algorithm cannot handle a dataset that large, then change the algorithm, not the dataset.