Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:
- The restriction to a rigid data flow model (Map followed by Reduce). Sometimes you need other flows e.g., map-reduce-map, union-map-reduce, join-reduce.
- Common data analysis operations, which are provided by database systems as primitives, need to be recoded by hand each time in Java or C/C++: e.g., join, filter, common aggregates, group by, union, distinct.
- The programmer has to hand-optimize the execution plan, for example by deciding how many map and reduce nodes are needed. For complex chained flows, this can become a nightmare. Databases provide query optimizers for this purpose -- the precise sequence of operations is decided by the optimizer rather than by a programmer.
Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.
The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using such a scripting language, one can express operations that require multiple map reduce steps, together with joins and other set-oriented data processing operations. This approach is exemplified by Pig Latin, being developed by a team at Yahoo. PigLatin provides primitive operations that are commonly found in database systems, such as Group By, Join, Filter, Union, ForEach, and Distinct. Each PigLatin operator can take a User Defined Function (UDF) as a parameter.
The programmer creates a script that chains these operators to achieve the desired effect. In effect, the programmer codes by hand the query execution plan that might have been generated by a SQL engine. The effect of a single Map Reduce can be simulated by a Filter step followed by a Group By step. In many common cases, we don't even need to use UDFs, if the filtering and grouping criteria are straightforward ones that are supported in PigLatin. The PigLatin engine translates each script into a sequence of jobs on a Hadoop cluster. The PigLatin team reports that 25% of Hadoop jobs on Yahoo today originate as PigLatin scripts. That's impressive adoption.
Another interesting solution in this category is Sawzall, a new scripting language developed at Google. Sawzall allows map reduce operations to be coded using a language that is reminiscent of awk. If your computation fits the Sawzall model, the code is much shorter and more elegant than C/C++/Java Map and Reduce functions. Sawzall, however, suffers from two drawbacks: it limits the programmer to a prefined set of aggregations in the Reduce phase (although it supplies a big library of these); and it offers no support for data analysis that goes beyond a single Map Reduce step, as PigLatin does. Most important, Sawzall is not available outside of Google, while PigLatin has been open-sourced by Yahoo.
The second approach is to integrate Map Reduce with a SQL database. Two database companies have recently announced support for MapReduce: Greenplum and Aster Data. Interestingly, they have taken two very different approaches. I will call Greenplum's approach "loose coupling" and Aster Data's approach "tight coupling". Let's examine each in turn.
Greenplum's loose-coupling approach ties together Greenplum's database with Hadoop's implementation of Map Reduce. A Hadoop Map Reduce operation is visible as a database view within Greenplum's SQL interpreter. Conversely, Hadoop map and reduce functions can access data in the database by iterating over the results of database queries. Issuing a SQL query that uses a map-reduce view will launch the corresponding map-reduce operation, whose results can then be processed by the rest of the SQL query.
Aster Data's tight-coupling approach is more interesting: the database natively supports map reduce (with no need for Hadoop). Map and reduce functions can be written in a variety of programming languages (C/C++, java, python). Aster has extended the SQL language itself to support how these functions get invoked, creating a new SQL dialect called SQL/MR. One of the cool features is that map and reduce functions are automatically polymorphic, just like native SQL functions such as SUM, COUNT and so on: the programmer can write them once and the database engine can invoke them with rows with different numbers of columns and columns of different types. This is a huge convenience over the Hadoop approach.
What are the pros and cons of these three different approaches? The advantage of the Pig Latin approach is that it works directly at the file level, and therefore it can express MapReduce computations that don't fit the relational data model. An example of such an operation is building an inverted index on a collection of text documents. Databases in general are bad at handling large text and image data, which are treated as "blobs."
The biggest disadvantages of the PigLatin approach is the need to learn an entirely new programming language. There is a large group of developers and DBA's familiar with SQL, and PigLatin does not have this support base. The second disadvantage is that the developer has to code declarative query plans by hand, while SQL programmer can rely on two decades of work on SQL query optimizers, which can automatically decide the order of operations, the degree of parallelism, and when to use indexes.
The advantages and disadvantages of the SQL integration approach in general mirror those of the Pig Latin approach. The loose coupling approach of Greenplum allows the use of files as well as relations, and therefore in principle supports file-based computations. The burden is on the application programmer, however, to decide on the scheduling and optimization of the Hadoop portion of the computation, without much help from the database.
Aster's tight-coupling approach, on the other hand, allows a much greater degree of automatic query optimization. The database system is intimately involved in the way map and reduce operations are scheduled across the cluster, and can decide on the degree of parallelism, as well use strategies such as pipelining across map reduce and relational operators. In addition, since the database system is solely in charge of overall resource allocation and usage, it also ensures sandboxing of user-defined code, preventing it from consuming too many resources and slowing down other tasks. For computations that use only data in the relational database, Aster by far has the most elegant solution; the weakness, of course, is that data stored outside the database is off-limits.
Update: Tassos Argyros from Aster Data points out that Aster's implementation does in fact allow access to data stored outside the database. The developer needs to write a UDF that exposes the data to the database engine.
All three approaches thus have their strengths and weaknesses. It's exciting to see the emergence of fresh thinking on data analytics, going beyond the initial file-oriented Map Reduce model. Over time, these approaches will evolve, borrowing learnings from one other. In time one or more will become the dominant paradigm for data analytics; I will be watching this space with great interest.
Disclosure: I'm an investor in Aster Data and sit on their Board of Directors.
You should also look at Hive which is being contributed by Facebook to the Hadoop project
1. http://issues.apache.org/jira/browse/HADOOP-3601
2. http://wiki.apache.org/hadoop/Hive
Posted by: Shalin Shekhar Mangar | September 05, 2008 at 10:29 PM
Hah, someone else already pointed out Hive -- beat me to it. It has an SQL-like language implemented on top of Hadoop.
Posted by: Brendan O'Connor | September 08, 2008 at 10:39 AM
Hi,
Nice article, thanks!
One clarification: Pig Latin does not preclude query optimization. You can optimize Pig Latin in much the same way one optimizes relational algebra trees. In fact we are building an optimizer for Pig now, and have a prototype running that does some basic System-R-style optimizations.
The point is that Pig Latin does not *require* a query optimizer -- a Pig Latin program implies a particular order of evaluation of the operators. This property is one that Pig Latin shares with Map-Reduce. And it is one of the important distinguishing characteristics between map-reduce and SQL database systems, which map-reduce advocates often point to. On many occasions I have heard/read map-reduce advocates voice their pleasure in the ability to simply "plug together pipes" and have the data flow through the pipes in the programmed order, rather than trust that the system will come up with a good order.
Even a person with a database background such as myself must admit that query optimization is a double-edged sword. The benefits of query optimization (and data independence) are touted in any database textbook, and certainly are not to be scoffed at. But on the other hand it's a hard problem, and even systems that have been around for a long time often get it wrong. That's why commercial database systems support "optimizer hints", which give you a round-about way to control the query plan.
Some programmers prefer to specify a query plan directly, rather than try to coerce the optimizer to generate a good plan. That's the niche Pig Latin (and other "dataflow" languages like Tioga and Aurora) aim to fill.
-Chris
Posted by: Chris Olston | September 08, 2008 at 10:42 AM
Shalin and Brendon: Thanks for the pointer to Hive. I'll check it out.
Chris: Thanks for your comment on PigLatin -- straight from the pig's mouth, as it were (sorry couldn't resist -- meant as a compliment :-) I do agree there is some appeal in the simplicity of "connecting the pipes".
The one concern I have with this approach is that as data sizes change or cluster sizes and topology change, a plan that was once optimal may no longer be right. Also, you must admit the bar is higher than writing SQL. That said, both the PigLatin approach and the SQL approach have their place. Great work on PigLatin, and please do keep pushing the frontier!
Posted by: Anand Rajaraman | September 08, 2008 at 02:39 PM
Hi Anand,
Happy to see a couple of people chime in about Hive already. We have been working on it for a few months now (a precursor version has been operational in Facebook since beginning 2008) and just released it to open source via hadoop-3601
While the post talks a lot about query language design - Hadoop and Aster/GP differ fundamentally in system design. Hadoop is way more loosely coupled and fault tolerant than commercial systems. It also makes less guarantees (yes - the file system can be up without all the blocks) and does not promise database like transactional consistency (or durability). In our experience - this kind of system design is just right for data warehousing scenarios. Partly as a result of this - hadoop scales to extremely large number of nodes (while potentially being less efficient). Hadoop File System looks at data placement hints as an optimization and not as a requirement. Building optimizations as a best effort layer - and not as a fundamental feature is a key architectural advantage imho.
wrt m/r+db integration - i would actually say that greenplums approach is more practical. lot of engineering data types now know and love hadoop. GP allows people to use the software they love. This is way better than trying to make all these switch to Aster's custom map-reduce container. Hadoop is way extensible and there is a community of people writing extensions and such. Strategically - i would rather be on the side of this community - than against it. wrt. isolation - Hadoop community has been working on numerous features - most prominently - there are a couple of fair scheduling schedulers around now (one contributed by Facebook - check out hadoop-3746).
while hive is based on sql - we have also taken the approach of hadoop wrt extensibility. we can ingest object data of any type (with an appropriate (hopefully community written) adapter) and do sql over such data. we use the enormously popular hadoop-streaming model to plug in user scripts/functions into different parts of a sql query. and of course we use udf's and such. wrt. optimizations - we will probably support hint based optimizations to begin with - as u mention it's hard to get things right all the time - but there aren't that many different ways to skin this stuff. a few hints go a long way in configuring the right plan for a given data set.
we also have indexing working in prototype mode - although integration into the execution plan is some ways off.
Posted by: Joydeep Sen Sarma | September 08, 2008 at 05:23 PM
Hi Anand,
Greenplum's implementation is tightly coupled and has nothing to do with Hadoop, with the exception that you can easily port source from Hadoop to Greenplum.
Our tight coupling to the Greenplum parallel dataflow engine scales to 1000's of processors and typically runs 50-100 times faster than Hadoop's executor.
We also feature a number of key features and extensions to the MapReduce paradigm, including:
- No SQL required to write an MR program
- Named, multi-column MAP/REDUCE input/output
- Simple join syntax (one line SQL-like joins)
- Native parallel network file access
- Mixed languages (Perl / Python / SQL / etc) in a single MR program
- Fully pipelined task execution - no need to land intermediate data on disk
Many of our customers have solved real business problems with Greenplum Map Reduce. Check out our tutorial on our MapReduce landing page:
http://www.greenplum.com/resources/MapReduce/
Posted by: Luke Lonergan | September 08, 2008 at 06:29 PM
Disclosure: I am an investor in Greenplum ;-)
Posted by: Luke Lonergan | September 08, 2008 at 06:31 PM
Joydeep and Luke, thanks for the detailed comments! This is an exciting time for large-scale data analysis with many approaches being tried.
Posted by: Anand Rajaraman | September 08, 2008 at 07:10 PM
Just to ring in on why Aster has taken the approach we have for In-Database MapReduce: The reason we didn't take a Hadoop-like path is precisely because we have a lot of appreciation in Hadoop and its development community - we do not believe the world needs a second Hadoop.
What we aim to is completely different: we want to bring the power of MapReduce to the database community. We humbly believe that one needs to figure out how to tightly integrate SQL with MapReduce to make this happen, which is what we have achieved for the first time. More here: http://www.asterdata.com/blog/index.php/2008/09/06/differences-between-aster-and-hadoop/
Posted by: Steve Wooledge | September 08, 2008 at 08:08 PM
@Steve - there's a slight irony in the document above. Analysts at traditional database companies are unlikely to ever write map-reduce code. We find that they are not even willing to write sql - they want to use standard reporting tools (and therefore the point about standard SQL and supporting tools is well taken). But it does leave the question open - if M/R in commercial databases is not targeted for companies with engineering talent - who is it targeted for?
Posted by: Joydeep Sen Sarma | September 09, 2008 at 09:12 PM
Joydeep, you bring up a great point. In fact, some of our customers may indeed never write SQL/MR functions – but the can still use SQL/MR functions that are developed by others, Aster or 3rd parties, since our interface is so much non-disruptive. This we are seeing today in actual customer situations and it’s a way to see the power of what we are doing.
But I also believe that there is an increasing need for enterprises, traditional or not, to do more with their data and compete on analytics. Amazon, Capital One and Wal-Mart are all data points of the same trend and they do an amazing amount of analytics. As a result, they are very competitive and successful. This trend will force enterprises to build more analytical expertise, e.g. by employing more people that can develop innovative analytical primitives. In our platform, these individuals (smart engineers, smart analysts or whatever) would write In-Database MapReduce functions that can then be reused by the rest of the enterprise through standard interfaces (SQL, JDBC/ODBC etc). We also expect the database ecosystem (e.g. BI tools) to gradually catch up with such innovations, further simplifying the use of more analytics. This is something that we have again seen working in the real world.
Finally, I believe there are people in the database community that will be willing to do data transformations and analytics by writing SQL/MR scripts, just like they would use a UDFs or "stored procedures" in a traditional database. These are talented technical people that live and breathe databases and by developing expertise in SQL/MR they may become the next database elite.
To sum up, I think we have a clear vision on how we can open up analytics to the database community. And we believe it requires both innovations like MapReduce and a disciplined belief in standard interfaces like SQL.
Posted by: Tasso Argyros | September 10, 2008 at 09:51 AM
In the following paper we presented at VLDB, we have developed and implemented a MR/SQL language. Both of the language constructs are composeable and allow the users to create applications without thinking about any of the distributed and fault tolerant mechanisms underneath. Users write programs as if they were targeting a single box. All of the infrastructure described in the paper is running hundreds of jobs a day on production clusters that are many, many thousands of servers.
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets” Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou, in Proc. of the 2008 VLDB Conference (VLDB'08).
Posted by: Ronnie Chaiken | September 22, 2008 at 07:10 PM
Another approach to map-reduce on top of Hadoop, with a bit more intuitive 'filters-pipes-taps' programming interface.
http://www.cascading.org/
Disclosure: not an investor :)
Posted by: Tadej | October 03, 2008 at 03:25 AM
hats off!
great post.. helped me in my college assignment :)
Posted by: Grunge Designer | January 09, 2009 at 06:40 PM