The internet and its usage are generating vast amounts of data. The volume of new data being generated today is unprecedented in human history. Much intelligence might be gleaned by intelligently mining this trove; yet most companies today archive most of their data and process only a tiny sliver of it. For example, most internet companies I know with significant traffic archive all usage data older than a couple of months. Who knows what gems of insight they're missing?
Over the past few years, the hardware platform that can tackle very high data volumes has been identified. It is the cluster of commodity Linux nodes interconnected using commodity gigabit ethernet. Google popularized the platform and introduced Map Reduce, which today is the best tool available for serious data crunching on massive scale. The Hadoop project makes Map Reduce available to the rest of us.
Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.
Database systems have been the workhorses of conventional (small to medium scale) data analysis applications for a couple of decades now. They have some advantages over Map Reduce. SQL is a standard for ad-hoc data analysis without actually writing code. There's a big ecosystem around SQL: people familiar with SQL, tools that work with SQL, application frameworks, and so on. Joins are straightforward in SQL, and databases provide implementations such as hash joins that scale well to big data. There is a large body of knowledge around query optimization that allows declaratively specified SQL queries to be optimized for execution in nontrivial ways. There is support for concurrency, updates, and transactions. There is a metadata catalog that allows people and programs to understand the contents of the database.
No one has been able to make database systems scale to the data volumes we are seeing today at an affordable price point. Sure, there is Teradata, with their custom designed hardware/software solution. They are the only serious game in town for sophisticated analysis of data in the terabytes. But the entry price point is in the several millions of dollars, and if your data size increases, the price per incremental terabyte is also at nosebleed levels. Also, proprietary hardware ignores the inexorable logic of Moore's law.
What the world needs is a database system natively designed and architected from the ground up for the new hardware platform: commodity clusters. Such a system greatly expands the number of companies and people that can do data analysis on terabytes of data. I'm delighted that Aster Data this week launched exactly such a database system this week. I led the seed round investment in Aster Data and sit on their board, so I've been waiting for this launch with great anticipation.
Aster comes with impressive testimonials from early customers. MySpace has been using Aster for several months now to analyze their usage data. Their data set has grown from 1 terabyte to 100 terabytes in just 100 days -- adding 1 terabyte a day. No other database installation in the world has grown at this rate. The growth is enabled by the commodity hardware platform underneath -- just add a server or a rack as the data grows. Another early adopter is Aggregate Knowledge, whose dataset has also grown at an impressive rate.
There's an interesting backstory to my involvement with Aster. When I taught the Distributed Databases course at Stanford in 2004, my TA was a bright young PhD student named Mayank Bawa. I was very impressed with Mayank's intellectual abilities and his can-do attitude. While working on the course we got to discussing how there is a shift happening in hardware platforms, but no database systems vendor has created a system targeting the new hardware platform. A few months later, Mayank told me he intended to start a company to build just such a database, along with two other Stanford graduate students: Tassos Argyros and George Candea. It was a brain dead simple investment decision for me.
Mayank, Tassos and George are among the best thinkers I know on the topic of large scale data analytics. They have been exceedingly thoughtful in the architecture of their new database. They have some really nifty stuff in there, including incremental scalability, intelligent data partitioning, and query optimization algorithms. One of the coolest demos I've seen is when they run a large query that touches terabytes of data, and then kill several of the nodes in the system that are actively processing the query. The system transparently migrates query processing, and out pop the results. This kind of behavior is essential when you're dealing with queries that can take several minutes to hours to execute, increasing the likelihood of failures during query processing, especially when using commodity nodes.
The LAMP stack, with MySQL as the base, has transformed and democratized web application development. In a similar vein, I expect that we will see the emergence of a stack that democratizes large-scale data analytics applications. Aster Data could well be the foundation of that stack.
Over the past few years, the hardware platform that can tackle very high data volumes has been identified. It is the cluster of commodity Linux nodes interconnected using commodity gigabit ethernet. Google popularized the platform and introduced Map Reduce, which today is the best tool available for serious data crunching on massive scale. The Hadoop project makes Map Reduce available to the rest of us.
Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.
Database systems have been the workhorses of conventional (small to medium scale) data analysis applications for a couple of decades now. They have some advantages over Map Reduce. SQL is a standard for ad-hoc data analysis without actually writing code. There's a big ecosystem around SQL: people familiar with SQL, tools that work with SQL, application frameworks, and so on. Joins are straightforward in SQL, and databases provide implementations such as hash joins that scale well to big data. There is a large body of knowledge around query optimization that allows declaratively specified SQL queries to be optimized for execution in nontrivial ways. There is support for concurrency, updates, and transactions. There is a metadata catalog that allows people and programs to understand the contents of the database.
No one has been able to make database systems scale to the data volumes we are seeing today at an affordable price point. Sure, there is Teradata, with their custom designed hardware/software solution. They are the only serious game in town for sophisticated analysis of data in the terabytes. But the entry price point is in the several millions of dollars, and if your data size increases, the price per incremental terabyte is also at nosebleed levels. Also, proprietary hardware ignores the inexorable logic of Moore's law.
What the world needs is a database system natively designed and architected from the ground up for the new hardware platform: commodity clusters. Such a system greatly expands the number of companies and people that can do data analysis on terabytes of data. I'm delighted that Aster Data this week launched exactly such a database system this week. I led the seed round investment in Aster Data and sit on their board, so I've been waiting for this launch with great anticipation.
Aster comes with impressive testimonials from early customers. MySpace has been using Aster for several months now to analyze their usage data. Their data set has grown from 1 terabyte to 100 terabytes in just 100 days -- adding 1 terabyte a day. No other database installation in the world has grown at this rate. The growth is enabled by the commodity hardware platform underneath -- just add a server or a rack as the data grows. Another early adopter is Aggregate Knowledge, whose dataset has also grown at an impressive rate.
There's an interesting backstory to my involvement with Aster. When I taught the Distributed Databases course at Stanford in 2004, my TA was a bright young PhD student named Mayank Bawa. I was very impressed with Mayank's intellectual abilities and his can-do attitude. While working on the course we got to discussing how there is a shift happening in hardware platforms, but no database systems vendor has created a system targeting the new hardware platform. A few months later, Mayank told me he intended to start a company to build just such a database, along with two other Stanford graduate students: Tassos Argyros and George Candea. It was a brain dead simple investment decision for me.
Mayank, Tassos and George are among the best thinkers I know on the topic of large scale data analytics. They have been exceedingly thoughtful in the architecture of their new database. They have some really nifty stuff in there, including incremental scalability, intelligent data partitioning, and query optimization algorithms. One of the coolest demos I've seen is when they run a large query that touches terabytes of data, and then kill several of the nodes in the system that are actively processing the query. The system transparently migrates query processing, and out pop the results. This kind of behavior is essential when you're dealing with queries that can take several minutes to hours to execute, increasing the likelihood of failures during query processing, especially when using commodity nodes.
The LAMP stack, with MySQL as the base, has transformed and democratized web application development. In a similar vein, I expect that we will see the emergence of a stack that democratizes large-scale data analytics applications. Aster Data could well be the foundation of that stack.
Amazing - the developments in recent years are really very interesting. Best of luck to Asterdata!
Additionally, I read today that Yahoo has a homegrown Petabyte database based on Postgres. Here's a link to the article.
http://www.informationweek.com/news/software/database/showArticle.jhtml?articleID=207801436&subSection=
On a sidenote, I remember you mentioned Yahoo Glue as a promising new approach to displaying search results. You may want to check out Yahoo SearchMonkey and some of the existing search result upgrades.
http://gallery.search.yahoo.com/
Best wishes,
Jack
Posted by: jack | May 22, 2008 at 04:27 PM
Google introduced map/reduce?
I seem to recall learning about it in college over 10 years ago, and it was old then.
Who said "Most research in computer science describes how the author discovered what somebody else already knew"?
Posted by: tc | May 22, 2008 at 08:56 PM
hbase is part of hadoop and contributed by powserset
Posted by: yawl | May 22, 2008 at 09:24 PM
Check out CouchDB: http://couchdb.org/
Cheers
Jan
--
Posted by: Jan | May 22, 2008 at 11:46 PM
Another excellent alternative that you didn't mention is seen Mark Logic Server ( http://marklogic.com/ ). Of course, I'm biased but check out MarkMail ( http://markmail.org/ ) for a slick, public example.
Posted by: John D. Mitchell | May 22, 2008 at 11:47 PM
One correction, MapReduce and specifically speaking about the Hadoop implementation does not work only on files.
It might seem that way, but it is actually very easy to just implement a different data provider which would produce the data from some source and then divide it and hand it off to the map tasks.
In fact, i wrote about this very same thing on my blog. You might be interested in checking it out. It is a simple MapReduce application that does primality testing on generated numbers.
Posted by: Alaa Salman | May 23, 2008 at 02:36 AM
Nice post! Definitely there is a need for a simple query language (SQL like) to process large amount of data distributed over commodity hardware.
I feel Hypertable (http://hypertable.org/about.html), an open source version of Bigtable (http://labs.google.com/papers/bigtable.html), may be a good way to address this problem. Hypertable uses a query language (HQL - Hypertable Query Language) that is similar to SQL. Hypertable can be installed over distributed file system like HDFS (http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html), and KFS (http://kosmosfs.sourceforge.net/). Since these file systems are fault-tolerant and scale well with the number of nodes in the system, Hypertable is also fault-tolerant and scalable.
I wonder how Aster data solution fares with respect to Hypertable.
Posted by: Mitul | May 23, 2008 at 03:49 PM
Seeing the comments here, it is clear that there will be many, many commodity cluster solutions in the coming years. A SQL driven database system is one of them. But, surely, databases is just one form factor of utilizing commodity clusters? The bigger news here is about utilizing commodity clusters?
From 2005 to 2007, a lot of data processing junkies - for different reasons and from different perspectives - came to the same conclusion about distributed computing. It reminds one of a recent article (http://www.newyorker.com/reporting/2008/05/12/080512fa_fact_gladwell) how 'The history of science is full of ideas that several people had at the same time.' The common idea here is 'commodity cluster utilization'?
Posted by: Anupam | May 26, 2008 at 12:43 PM
The reason LAMP and MySQL have transformed and democratized the web development has almost everything to with the words "almost free".
Datawarehousing is still largely an expensive proposition that only large corporations can afford. If Map Reduce can provide the data crunching and analysis at the price points most web development community can afford, then it (in its open-source incarnation aka Hadoop) will likely going to be chosen as the one to democratize the datawarehousing for the masses. I can easily foresee that more and more programming frameworks and tools will develop around Map Reduce to compensate for its lack thereof (Yahoo's Pig, Sawzall etc. or already solving some of these problems). Of course one will not be able to get around the cost of "commodity hardware" itself that Map Reduce based solutions require unless of course one is willing to send the data off to the cloud (Amazon EC2 etc.) and rent the hardware off of there (Renting Hadoop off of EC2 is clearly where the masses will start from).
Now, from AsterData's initial list of customers, it appears that they are also targeting the corporate customers at this point - MySpace, AggregateKnowledge who clearly are going to be able to afford such solutions. In order for AsterData (or any other such new database system) to truly be able to democratize datawarehousing for the rest of us, they'll either have to "give it away" or at least partner with EC2 or appengine and then "give it away".
Posted by: Rajeev Tipnis | May 29, 2008 at 12:37 PM
Great post, I really found this useful! As a sidenote there's this new site on the net which is looking at bringing together like-minded database users & professionals. It's mainly a forum site but looks like it's got potential to grow. Here is a blurb from the site: ---- SQLSet.com the place to go for all things database. The Forums are grouped for easy navigation, registration is quick and simple. Once registered you can begin posting threads and exchanging ideas. ---- I'd recommend visiting if you've got an interest in databases: http://www.sqlset.com Cheers
Posted by: Johnny | September 03, 2008 at 03:36 PM