« Data, Algorithms, and Powerset | Main | Are Machine-Learned Models Prone to Catastrophic Errors? »


Feed You can follow this conversation by subscribing to the comment feed for this post.


Amazing - the developments in recent years are really very interesting. Best of luck to Asterdata!

Additionally, I read today that Yahoo has a homegrown Petabyte database based on Postgres. Here's a link to the article.

On a sidenote, I remember you mentioned Yahoo Glue as a promising new approach to displaying search results. You may want to check out Yahoo SearchMonkey and some of the existing search result upgrades.

Best wishes,


Google introduced map/reduce?

I seem to recall learning about it in college over 10 years ago, and it was old then.

Who said "Most research in computer science describes how the author discovered what somebody else already knew"?


hbase is part of hadoop and contributed by powserset


Check out CouchDB: http://couchdb.org/


John D. Mitchell

Another excellent alternative that you didn't mention is seen Mark Logic Server ( http://marklogic.com/ ). Of course, I'm biased but check out MarkMail ( http://markmail.org/ ) for a slick, public example.

Alaa Salman

One correction, MapReduce and specifically speaking about the Hadoop implementation does not work only on files.

It might seem that way, but it is actually very easy to just implement a different data provider which would produce the data from some source and then divide it and hand it off to the map tasks.

In fact, i wrote about this very same thing on my blog. You might be interested in checking it out. It is a simple MapReduce application that does primality testing on generated numbers.


Nice post! Definitely there is a need for a simple query language (SQL like) to process large amount of data distributed over commodity hardware.

I feel Hypertable (http://hypertable.org/about.html), an open source version of Bigtable (http://labs.google.com/papers/bigtable.html), may be a good way to address this problem. Hypertable uses a query language (HQL - Hypertable Query Language) that is similar to SQL. Hypertable can be installed over distributed file system like HDFS (http://hadoop.apache.org/core/docs/r0.16.4/hdfs_design.html), and KFS (http://kosmosfs.sourceforge.net/). Since these file systems are fault-tolerant and scale well with the number of nodes in the system, Hypertable is also fault-tolerant and scalable.

I wonder how Aster data solution fares with respect to Hypertable.


Seeing the comments here, it is clear that there will be many, many commodity cluster solutions in the coming years. A SQL driven database system is one of them. But, surely, databases is just one form factor of utilizing commodity clusters? The bigger news here is about utilizing commodity clusters?

From 2005 to 2007, a lot of data processing junkies - for different reasons and from different perspectives - came to the same conclusion about distributed computing. It reminds one of a recent article (http://www.newyorker.com/reporting/2008/05/12/080512fa_fact_gladwell) how 'The history of science is full of ideas that several people had at the same time.' The common idea here is 'commodity cluster utilization'?

Rajeev Tipnis

The reason LAMP and MySQL have transformed and democratized the web development has almost everything to with the words "almost free".

Datawarehousing is still largely an expensive proposition that only large corporations can afford. If Map Reduce can provide the data crunching and analysis at the price points most web development community can afford, then it (in its open-source incarnation aka Hadoop) will likely going to be chosen as the one to democratize the datawarehousing for the masses. I can easily foresee that more and more programming frameworks and tools will develop around Map Reduce to compensate for its lack thereof (Yahoo's Pig, Sawzall etc. or already solving some of these problems). Of course one will not be able to get around the cost of "commodity hardware" itself that Map Reduce based solutions require unless of course one is willing to send the data off to the cloud (Amazon EC2 etc.) and rent the hardware off of there (Renting Hadoop off of EC2 is clearly where the masses will start from).

Now, from AsterData's initial list of customers, it appears that they are also targeting the corporate customers at this point - MySpace, AggregateKnowledge who clearly are going to be able to afford such solutions. In order for AsterData (or any other such new database system) to truly be able to democratize datawarehousing for the rest of us, they'll either have to "give it away" or at least partner with EC2 or appengine and then "give it away".


Great post, I really found this useful! As a sidenote there's this new site on the net which is looking at bringing together like-minded database users & professionals. It's mainly a forum site but looks like it's got potential to grow. Here is a blurb from the site: ---- SQLSet.com the place to go for all things database. The Forums are grouped for easy navigation, registration is quick and simple. Once registered you can begin posting threads and exchanging ideas. ---- I'd recommend visiting if you've got an interest in databases: http://www.sqlset.com Cheers

The comments to this entry are closed.