Over the past few years, the hardware platform that can tackle very high data volumes has been identified. It is the cluster of commodity Linux nodes interconnected using commodity gigabit ethernet. Google popularized the platform and introduced Map Reduce, which today is the best tool available for serious data crunching on massive scale. The Hadoop project makes Map Reduce available to the rest of us.
Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.
Database systems have been the workhorses of conventional (small to medium scale) data analysis applications for a couple of decades now. They have some advantages over Map Reduce. SQL is a standard for ad-hoc data analysis without actually writing code. There's a big ecosystem around SQL: people familiar with SQL, tools that work with SQL, application frameworks, and so on. Joins are straightforward in SQL, and databases provide implementations such as hash joins that scale well to big data. There is a large body of knowledge around query optimization that allows declaratively specified SQL queries to be optimized for execution in nontrivial ways. There is support for concurrency, updates, and transactions. There is a metadata catalog that allows people and programs to understand the contents of the database.
No one has been able to make database systems scale to the data volumes we are seeing today at an affordable price point. Sure, there is Teradata, with their custom designed hardware/software solution. They are the only serious game in town for sophisticated analysis of data in the terabytes. But the entry price point is in the several millions of dollars, and if your data size increases, the price per incremental terabyte is also at nosebleed levels. Also, proprietary hardware ignores the inexorable logic of Moore's law.
What the world needs is a database system natively designed and architected from the ground up for the new hardware platform: commodity clusters. Such a system greatly expands the number of companies and people that can do data analysis on terabytes of data. I'm delighted that Aster Data this week launched exactly such a database system this week. I led the seed round investment in Aster Data and sit on their board, so I've been waiting for this launch with great anticipation.
Aster comes with impressive testimonials from early customers. MySpace has been using Aster for several months now to analyze their usage data. Their data set has grown from 1 terabyte to 100 terabytes in just 100 days -- adding 1 terabyte a day. No other database installation in the world has grown at this rate. The growth is enabled by the commodity hardware platform underneath -- just add a server or a rack as the data grows. Another early adopter is Aggregate Knowledge, whose dataset has also grown at an impressive rate.
There's an interesting backstory to my involvement with Aster. When I taught the Distributed Databases course at Stanford in 2004, my TA was a bright young PhD student named Mayank Bawa. I was very impressed with Mayank's intellectual abilities and his can-do attitude. While working on the course we got to discussing how there is a shift happening in hardware platforms, but no database systems vendor has created a system targeting the new hardware platform. A few months later, Mayank told me he intended to start a company to build just such a database, along with two other Stanford graduate students: Tassos Argyros and George Candea. It was a brain dead simple investment decision for me.
Mayank, Tassos and George are among the best thinkers I know on the topic of large scale data analytics. They have been exceedingly thoughtful in the architecture of their new database. They have some really nifty stuff in there, including incremental scalability, intelligent data partitioning, and query optimization algorithms. One of the coolest demos I've seen is when they run a large query that touches terabytes of data, and then kill several of the nodes in the system that are actively processing the query. The system transparently migrates query processing, and out pop the results. This kind of behavior is essential when you're dealing with queries that can take several minutes to hours to execute, increasing the likelihood of failures during query processing, especially when using commodity nodes.
The LAMP stack, with MySQL as the base, has transformed and democratized web application development. In a similar vein, I expect that we will see the emergence of a stack that democratizes large-scale data analytics applications. Aster Data could well be the foundation of that stack.