« Stop Email Overload and Break Silos Using Wikis, Blogs, and IM | Main | Google Chrome: A Masterstroke or a Blunder? »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83471bc3153ef00e55504ba558834

Listed below are links to weblogs that reference Bridging the Gap between Relational Databases and MapReduce: Three New Approaches:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Shalin Shekhar Mangar

You should also look at Hive which is being contributed by Facebook to the Hadoop project

1. http://issues.apache.org/jira/browse/HADOOP-3601
2. http://wiki.apache.org/hadoop/Hive

Brendan O'Connor

Hah, someone else already pointed out Hive -- beat me to it. It has an SQL-like language implemented on top of Hadoop.

Chris Olston

Hi,

Nice article, thanks!

One clarification: Pig Latin does not preclude query optimization. You can optimize Pig Latin in much the same way one optimizes relational algebra trees. In fact we are building an optimizer for Pig now, and have a prototype running that does some basic System-R-style optimizations.

The point is that Pig Latin does not *require* a query optimizer -- a Pig Latin program implies a particular order of evaluation of the operators. This property is one that Pig Latin shares with Map-Reduce. And it is one of the important distinguishing characteristics between map-reduce and SQL database systems, which map-reduce advocates often point to. On many occasions I have heard/read map-reduce advocates voice their pleasure in the ability to simply "plug together pipes" and have the data flow through the pipes in the programmed order, rather than trust that the system will come up with a good order.

Even a person with a database background such as myself must admit that query optimization is a double-edged sword. The benefits of query optimization (and data independence) are touted in any database textbook, and certainly are not to be scoffed at. But on the other hand it's a hard problem, and even systems that have been around for a long time often get it wrong. That's why commercial database systems support "optimizer hints", which give you a round-about way to control the query plan.

Some programmers prefer to specify a query plan directly, rather than try to coerce the optimizer to generate a good plan. That's the niche Pig Latin (and other "dataflow" languages like Tioga and Aurora) aim to fill.

-Chris

Anand Rajaraman

Shalin and Brendon: Thanks for the pointer to Hive. I'll check it out.

Chris: Thanks for your comment on PigLatin -- straight from the pig's mouth, as it were (sorry couldn't resist -- meant as a compliment :-) I do agree there is some appeal in the simplicity of "connecting the pipes".

The one concern I have with this approach is that as data sizes change or cluster sizes and topology change, a plan that was once optimal may no longer be right. Also, you must admit the bar is higher than writing SQL. That said, both the PigLatin approach and the SQL approach have their place. Great work on PigLatin, and please do keep pushing the frontier!

Joydeep Sen Sarma

Hi Anand,

Happy to see a couple of people chime in about Hive already. We have been working on it for a few months now (a precursor version has been operational in Facebook since beginning 2008) and just released it to open source via hadoop-3601

While the post talks a lot about query language design - Hadoop and Aster/GP differ fundamentally in system design. Hadoop is way more loosely coupled and fault tolerant than commercial systems. It also makes less guarantees (yes - the file system can be up without all the blocks) and does not promise database like transactional consistency (or durability). In our experience - this kind of system design is just right for data warehousing scenarios. Partly as a result of this - hadoop scales to extremely large number of nodes (while potentially being less efficient). Hadoop File System looks at data placement hints as an optimization and not as a requirement. Building optimizations as a best effort layer - and not as a fundamental feature is a key architectural advantage imho.

wrt m/r+db integration - i would actually say that greenplums approach is more practical. lot of engineering data types now know and love hadoop. GP allows people to use the software they love. This is way better than trying to make all these switch to Aster's custom map-reduce container. Hadoop is way extensible and there is a community of people writing extensions and such. Strategically - i would rather be on the side of this community - than against it. wrt. isolation - Hadoop community has been working on numerous features - most prominently - there are a couple of fair scheduling schedulers around now (one contributed by Facebook - check out hadoop-3746).

while hive is based on sql - we have also taken the approach of hadoop wrt extensibility. we can ingest object data of any type (with an appropriate (hopefully community written) adapter) and do sql over such data. we use the enormously popular hadoop-streaming model to plug in user scripts/functions into different parts of a sql query. and of course we use udf's and such. wrt. optimizations - we will probably support hint based optimizations to begin with - as u mention it's hard to get things right all the time - but there aren't that many different ways to skin this stuff. a few hints go a long way in configuring the right plan for a given data set.

we also have indexing working in prototype mode - although integration into the execution plan is some ways off.


Luke Lonergan

Hi Anand,

Greenplum's implementation is tightly coupled and has nothing to do with Hadoop, with the exception that you can easily port source from Hadoop to Greenplum.

Our tight coupling to the Greenplum parallel dataflow engine scales to 1000's of processors and typically runs 50-100 times faster than Hadoop's executor.

We also feature a number of key features and extensions to the MapReduce paradigm, including:
- No SQL required to write an MR program
- Named, multi-column MAP/REDUCE input/output
- Simple join syntax (one line SQL-like joins)
- Native parallel network file access
- Mixed languages (Perl / Python / SQL / etc) in a single MR program
- Fully pipelined task execution - no need to land intermediate data on disk

Many of our customers have solved real business problems with Greenplum Map Reduce. Check out our tutorial on our MapReduce landing page:
http://www.greenplum.com/resources/MapReduce/

Luke Lonergan

Disclosure: I am an investor in Greenplum ;-)

Anand Rajaraman

Joydeep and Luke, thanks for the detailed comments! This is an exciting time for large-scale data analysis with many approaches being tried.

Steve Wooledge

Just to ring in on why Aster has taken the approach we have for In-Database MapReduce: The reason we didn't take a Hadoop-like path is precisely because we have a lot of appreciation in Hadoop and its development community - we do not believe the world needs a second Hadoop.

What we aim to is completely different: we want to bring the power of MapReduce to the database community. We humbly believe that one needs to figure out how to tightly integrate SQL with MapReduce to make this happen, which is what we have achieved for the first time. More here: http://www.asterdata.com/blog/index.php/2008/09/06/differences-between-aster-and-hadoop/

Joydeep Sen Sarma

@Steve - there's a slight irony in the document above. Analysts at traditional database companies are unlikely to ever write map-reduce code. We find that they are not even willing to write sql - they want to use standard reporting tools (and therefore the point about standard SQL and supporting tools is well taken). But it does leave the question open - if M/R in commercial databases is not targeted for companies with engineering talent - who is it targeted for?

Tasso Argyros

Joydeep, you bring up a great point. In fact, some of our customers may indeed never write SQL/MR functions – but the can still use SQL/MR functions that are developed by others, Aster or 3rd parties, since our interface is so much non-disruptive. This we are seeing today in actual customer situations and it’s a way to see the power of what we are doing.

But I also believe that there is an increasing need for enterprises, traditional or not, to do more with their data and compete on analytics. Amazon, Capital One and Wal-Mart are all data points of the same trend and they do an amazing amount of analytics. As a result, they are very competitive and successful. This trend will force enterprises to build more analytical expertise, e.g. by employing more people that can develop innovative analytical primitives. In our platform, these individuals (smart engineers, smart analysts or whatever) would write In-Database MapReduce functions that can then be reused by the rest of the enterprise through standard interfaces (SQL, JDBC/ODBC etc). We also expect the database ecosystem (e.g. BI tools) to gradually catch up with such innovations, further simplifying the use of more analytics. This is something that we have again seen working in the real world.

Finally, I believe there are people in the database community that will be willing to do data transformations and analytics by writing SQL/MR scripts, just like they would use a UDFs or "stored procedures" in a traditional database. These are talented technical people that live and breathe databases and by developing expertise in SQL/MR they may become the next database elite.

To sum up, I think we have a clear vision on how we can open up analytics to the database community. And we believe it requires both innovations like MapReduce and a disciplined belief in standard interfaces like SQL.

Ronnie Chaiken

In the following paper we presented at VLDB, we have developed and implemented a MR/SQL language. Both of the language constructs are composeable and allow the users to create applications without thinking about any of the distributed and fault tolerant mechanisms underneath. Users write programs as if they were targeting a single box. All of the infrastructure described in the paper is running hundreds of jobs a day on production clusters that are many, many thousands of servers.

SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets” Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou, in Proc. of the 2008 VLDB Conference (VLDB'08).

Tadej

Another approach to map-reduce on top of Hadoop, with a bit more intuitive 'filters-pipes-taps' programming interface.

http://www.cascading.org/

Disclosure: not an investor :)

Grunge Designer

hats off!
great post.. helped me in my college assignment :)

The comments to this entry are closed.