Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising

India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime

I recently started using Twitter and have become a big fan of the service. I've been appalled by the downtime the service has endured, but sympathetic because I assumed the growth in usage is so fast that much might be excused. Then I read this TechCrunch post on the Twitter usage numbers and sympathy turned to bafflement - because I'm intimately familiar with SMS Gupshup, a startup in India that boasts usage numbers much, much higher than Twitter's, but has scaled without a glitch.

I'll let the numbers speak for themselves:

  • Users: Twitter (1+ million), SMS GupShup (7 million)
  • Messages per day: Twitter (3 million); SMS GupShup (10+ million)

Actually, these numbers don't even tell the whole story. India is a land of few PCs and many mobile phones. Thus, almost all GupShup messages are posted via mobile phones using SMS. And almost every GupShup message is posted simultaneously to the website and to the mobile phones of followers via SMS. That's why they have the SMS in the name of the service. Contrast with Twitter, where the majority of the posting and reading is done through the web. Twitter has said in the past that sending messages via the SMS gateway is one of their most expensive operations, so the fact that only a small fraction of their users use the SMS option makes their task a lot easier than GupShup's.

So I sat down with Beerud Sheth, co-founder of Webaroo, the company behind GupShup (the other founder Rakesh Mathur is my co-founder from a prior company, Junglee). I wanted to understand why GupShup scaled without a hitch while Twitter is having fits. Beerud tells me that GupShup runs on commodity Linux hardware and uses MySQL, the same as Twitter. But the big difference is in the architecture: right from day 1, they started with a three-tier architecture, with JBoss app servers sitting between the webservers and the database.

GupShup also uses an object architecture (called the "objectpool") which allows each task to be componentized and run separately - this helps immensely with reliability (can automatically handle machine failure) and scalability (can scale dynamically to handle increased load). The objectpool model allows each module to be run as multiple parallel instances - each of them doing a part of the work. They can be run on different machines, can be started/stopped independently, without affecting each other. So the "receiver", the "sender", and the "ad server" all run as multiple instances. As traffic scales, they can just add more hardware -- no re-architecting. If one machine fails, the instance is restarted on a different machine.

In read/write applications, the database is often the bottleneck. To avoid this problem, the GupShup database is sharded. So, the tables are broken into parts. For e.g., users A-F in one instance, G-K in another etc. The shards are periodically rebalanced as the database grows. The JBoss middle-tier contains the logic that hides this detail from the webserver tier.

I'm not familiar with the details of Twitter's architecture, beyond knowing they use Ruby on Rails with MySQL. It appears that the biggest difference between Twitter and GupShup is 3-tier versus 2-tier. RoR is fantastic for turning out applications quickly, but the way Rails works, the out-of-the-box approach leads to a two-tier architecture (webserver talking directly to database). We all learned back in the 90's that this is an unscalable model, yet it is the model for most Rails applications. No amount of caching can help a 2-tier read/write application scale. The middle-tier enables the database to be sharded, and that's what gets you the scalability. I believe Twitter has recently started using message queues as a middle-tier to accomplish the same thing, but they haven't partitioned the database yet -- which is the key step here.

I don't intend this as a knock on RoR, rather on the way it is used by default. At my company Kosmix we use an RoR frontend for a website that serves millions of page views every day; we use a 3-tier model where the bulk of the application logic resides in a middle-tier coded in C++. Three-tier is the way to go to build scalable web applications, regardless of the programming language(s) you use.

Update: VentureBeat has a follow-up guest post by me, with some more details on SMS GupShup. Also my theory on why SMS GupShup is growing faster than Twitter: Microblogging is a nice-to-have in places with high PC penetration, like the US, but a must-have in places with very low PC penetration, like India.

Disclosure: My fund Cambrian Ventures is an investor in Webaroo, the company behind SMS GupShup. But these are my opinions as a database geek, not as an investor.

June 14, 2008 in India, Internet Infrastructure, Mobile | Permalink | Comments (40) | TrackBack (0)

Why the World Needs a New Database System

The internet and its usage are generating vast amounts of data. The volume of new data being generated today is unprecedented in human history. Much intelligence might be gleaned by intelligently mining this trove; yet most companies today archive most of their data and process only a tiny sliver of it. For example, most internet companies I know with significant traffic archive all usage data older than a couple of months. Who knows what gems of insight they're missing?

Over the past few years, the hardware platform that can tackle very high data volumes has been identified. It is the cluster of commodity Linux nodes interconnected using commodity gigabit ethernet. Google popularized the platform and introduced Map Reduce, which today is the best tool available for serious data crunching on  massive scale. The Hadoop project makes Map Reduce available to the rest of us.

Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.

Database systems have been the workhorses of conventional (small to medium scale) data analysis applications for a couple of decades now. They have some advantages over Map Reduce. SQL is a standard for ad-hoc data analysis without actually writing code. There's a big ecosystem around SQL: people familiar with SQL, tools that work with SQL, application frameworks, and so on. Joins are straightforward in SQL, and databases provide implementations such as hash joins that scale well to big data. There is a large body of knowledge around query optimization that allows declaratively specified SQL queries to be optimized for execution in nontrivial ways. There is support for concurrency, updates, and transactions. There is a metadata catalog that allows people and programs to understand the contents of the database.

No one has been able to make database systems scale to the data volumes we are seeing today at an affordable price point. Sure, there is Teradata, with their custom designed hardware/software solution. They are the only serious game in town for sophisticated analysis of data in the terabytes. But the entry price point is in the several millions of dollars, and if your data size increases, the price per incremental terabyte is also at nosebleed levels. Also, proprietary hardware ignores the inexorable logic of Moore's law.

What the world needs is a database system natively designed and architected from the ground up for the new hardware platform: commodity clusters. Such a system greatly expands the number of companies and people that can do data analysis on terabytes of data. I'm delighted that Aster Data this week launched exactly such a database system this week. I led the seed round investment in Aster Data and sit on their board, so I've been waiting for this launch with great anticipation.

Aster comes with impressive testimonials from early customers. MySpace has been using Aster for several months now to analyze their usage data. Their data set has grown from 1 terabyte to 100 terabytes in just 100 days -- adding 1 terabyte a day. No other database installation in the world has grown at this rate. The growth is enabled by the commodity hardware platform underneath -- just add a server or a rack as the data grows. Another early adopter is Aggregate Knowledge, whose dataset has also grown at an impressive rate.

There's an interesting backstory to my involvement with Aster. When I taught the Distributed Databases course at Stanford in 2004, my TA was a bright young PhD student named Mayank Bawa. I was very impressed with Mayank's intellectual abilities and his can-do attitude. While working on the course we got to discussing how there is a shift happening in hardware platforms, but no database systems vendor has created a system targeting the new hardware platform. A few months later, Mayank told me he intended to start a company to build just such a database, along with two other Stanford graduate students: Tassos Argyros and George Candea. It was a brain dead simple investment decision for me.

Mayank, Tassos and George are among the best thinkers I know on the topic of large scale data analytics. They have been exceedingly thoughtful in the architecture of their new database. They have some really nifty stuff in there, including incremental scalability, intelligent data partitioning, and query optimization algorithms. One of the coolest demos I've seen is when they run a large query that touches terabytes of data, and then kill several of the nodes in the system that are actively processing the query. The system transparently migrates query processing, and out pop the results. This kind of behavior is essential when you're dealing with queries that can take several minutes to hours to execute, increasing the likelihood of failures during query processing, especially when using commodity nodes.

The LAMP stack, with MySQL as the base, has transformed and democratized web application development. In a similar vein, I expect that we will see the emergence of a stack that democratizes large-scale data analytics applications. Aster Data could well be the foundation of that stack.

May 22, 2008 in Data Mining, Internet Infrastructure | Permalink | Comments (10) | TrackBack (0)

Network effects propel cloud computing

Erik Schonfeld over at TechCrunch has a very interesting post  revealing that big enterprises are adopting Amazon's EC2/S3 services at a far faster pace than previously imagined:

A high-ranking Amazon executive told me there are 60,000 different customers across the various Amazon Web Services, and most of them are not the startups that are normally associated with on-demand computing. Rather the biggest customers in both number and amount of computing resources consumed are divisions of banks, pharmaceuticals companies and other large corporations who try AWS once for a temporary project, and then get hooked.

This is epochal stuff -- banks and pharma are notorious late adopters of early-stage technology, so to see them in the vanguard of cloud computing (or perhaps I should say utility computing, but everyone says cloud) is astonishing. But it illustrates a very important detail that's been overlooked: that there are significant network effects to the cloud computing business.

There are two basic underlying forces behind the network effects:

  1. Code that works with large amounts of data needs to be close to the data (in the network topology sense).
  2. Any processing that consumes data generates data.

So, once one enterprising group within a company decides to place some data in S3 and the code to process it on EC2, it becomes a whole lot easier for someone else within the company who needs to run some other code on the data, to move the processing to EC2. And since all this processing generates even more data, we have a virtuous cycle building up. The stable state is for all of a company's data processing tasks to move into the same utility computing cloud, to take advantage of the co-location and minimize data transfer latency and costs.

The network effect extends across companies as well. Often data created by company A is consumed by company B. When this "data interface" is voluminous, it makes economic sense for company B to move into the same utility cloud as company A. There are some ecosystems where utility computing players are already exploiting this trend; for example, AppNexus is creating a utility cloud optimized for the use of ad networks and their associated ecosystem: analytics for publishers and advertisers. There is so much data being shared here (on ad campaigns and their performance) that there is significant advantage to being in the same cloud.

The network effects argument leads to the interesting possibility that cloud computing becomes a winner-take-all game, like auctions; we might end up with one winner (maybe Amazon?) A more likely outcome is, we might end up with a couple of big general-purpose clouds (Amazon and Google, perhaps?) and a few niche clouds optimized for different ecosystems (such as ad networks and social networks).

April 21, 2008 in Internet Infrastructure | Permalink | Comments (4) | TrackBack (0)

About

  • Anand Rajaraman
  • Datawocky

Recent Posts

  • Stanford Big Data Course Now Open to the World!
  • Goodbye, Kosmix. Hello, @WalmartLabs
  • Retail + Social + Mobile = @WalmartLabs
  • Creating a Culture of Innovation: Why 20% Time is not Enough
  • Reboot: How to Reinvent a Technology Startup
  • Oscar Halo: Academy Awards and the Matthew Effect
  • Kosmix Adds Rocketfuel to Power Voyage of Exploration
  • For Startups, Survival is not a Strategy
  • Google Chrome: A Masterstroke or a Blunder?
  • Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Recent Comments

  • mona on Stanford Big Data Course Now Open to the World!
  • Voyager on Stanford Big Data Course Now Open to the World!
  • Gautam Bajekal on Stanford Big Data Course Now Open to the World!
  • online jobs on Not all powerlaws have long tails
  • rc helicopter on Not all powerlaws have long tails
  • tory burch outlet on Goodbye, Kosmix. Hello, @WalmartLabs
  • SHARETIPSINFO on Goodbye, Kosmix. Hello, @WalmartLabs
  • Almeda Alair on Goodbye, Kosmix. Hello, @WalmartLabs
  • discount mbt on Retail + Social + Mobile = @WalmartLabs
  • custom logo design on Retail + Social + Mobile = @WalmartLabs

Archives

  • September 2014
  • May 2011
  • April 2011
  • April 2009
  • February 2009
  • December 2008
  • November 2008
  • September 2008
  • July 2008
  • June 2008

More...

Blogroll

  • The Numbers Guy
  • Paul Kedrosky's Infectious Greed
  • Life in the Bit Bubble
  • Kosmix Blog
  • John Battelle's Searchblog
  • GigaOM
  • Geeking with Greg
  • Efficient Frontier Insights
  • Data Mining Research
  • Constructive Pessimist, Cynical Optimist

 Subscribe in a reader

Subscribe to Datawocky by Email

Popular Posts

  • Are Machine-Learned Models Prone to Catastrophic Errors?
  • Why the World Needs a New Database System
  • Why Yahoo Glue is a Bigger Deal than You Think
  • The story behind Google's crawler upgrade
  • Affinity and Herding Determine the Effectiveness of Social Media Advertising
  • More data usually beats better algorithms, Part 2
  • More data usually beats better algorithms
  • How Google Measures Search Quality
  • Angel, VC, or Bootstrap?
  • India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime

Categories

  • Advertising (6)
  • Data Mining (11)
  • Entrepreneurship: views from the trenches (2)
  • India (5)
  • Internet Infrastructure (3)
  • kosmix (2)
  • Lewis Carroll (1)
  • Mobile (6)
  • Search (10)
  • Social Media (2)
  • Venture Capital (4)
See More

Twitter Updates

    follow me on Twitter