Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising

Oscar Halo: Academy Awards and the Matthew Effect

Slumdog Millionaire is one my favorite movies of all time. And I have followed the career of A.R. Rahman, who composed the movie's music, for several years ever since his debut in 1992. So I was quite thrilled when Slumdog was nominated for 10 academy awards -- and Rahman in two categories, Original Score and Original Song. Thrilled, and a little surprised: while I like Rahman's work in Slumdog, I don't think it's his best work. There is of course nothing wrong with that, as long as Rahman's work is better than that of his competitors this year.

But it got me to thinking: if Rahman had composed the same music for an obscure film this year, rather than for Slumdog Millionaire, would he have been nominated? And even if he had been nominated, what are his chances of winning? In other words, is there a Matthew Effect in Oscar nominations -- to them that have, more shall be given? And, once nominated, is there a halo surrounding movies with many nominations that improves the odds of winning across many award categories? I thought it might be fun to run the numbers based on past years' nominees and winners to see if I could find answers to these questions; it turned out to be somewhat instructive as well, since it required an extension of the standard Market Basket analysis from the world of data mining.

To get the data, I went straight to the source: the official Academy Awards database , which lists all the nominations and winners for the past 80 years. Unfortunately there is not a single page that lists all this information, but it was fairly straightforward to write python scripts that queried the website a few times and collated the data in tabular form. The result: a table that lists every nomination and winner in every category beteen 1927 and 2007. There were 8616 nominations in the period, representing 4215 distinct movies; so each movie was nominated on average for 2 award categories.

Let's start first with the nominations, to see if there is any evidence of the Matthew Effect. Let's say N(k) is the number of movies with exactly k nominations. The table below shows k and N(k) for k between 1 and 10. If we ignore two outliers (k=1 and k=7), it appears that N(k+1)/N(k) is close to 0.6 for k between 2 and 10; the decay is certainly much slower than exponential. This indicates that the number of nominations roughly follows a power-law; and a power-law is the classic embodiment of of the Matthew Effect, arising in contexts such as income and wealth distribution. The table below summarizes the data.

Nominations
Movies
1
2796
2
513
3
260
4
195
5
128
6
81
7
87
8
50
9
31
10
29


The next step is to enquire whether there are Oscar categories for which the effect is much stronger than for other categories. To study this, we divide the nominated movies into two groups: movies with 4 or fewer nominations (the "poor" group) and movies with 5 or more nominations (the "rich" group). Overall, 5382 nominations, or 62.5%, went to movies in the poor group and 3234 nominations, or 37.5%, went to movies in the rich group. Now, let's look at the major Oscar categories. The major outliers are Best Picture and Best Director -- both nominations went overwhelmingly to movies in the rich category (70% and 73%, respectively, compared to the average of 37.5%). This is not surprising, because the best picture is typically one that is strong in many disciplines. There is some bias in the acting categories as well, but the big surprise is Film Editing: 68% of the nominations in this category are "rich" movies. At other extreme are Music and Special Effects: approximately 70% of the nominated movies are in the "poor" category. So it appears that in these categories at least, talent gets its due without help from Matthew.

Moving from nominations to actual winners, the obvious question is: does being nominated in many categories boost the chances of winning in a disproportionate manner? To study this, I used the Market Baskets approach from Data Mining. In a classic Market Baskets scenario, we ask which items are often purchased together: such as milk and eggs. In this case, we model each movie as a basket: the contents of a movie's basket are its nominations and wins. Do movies with many nominations in their baskets have a disproportionate number of wins?

We must first deal with a technicality. In a normal market basket scenario, the contents of each basket are independent of every other basket, but in this case there are dependencies. Consider the set of market baskets of the movies that have all been nominated in a single award category in a particular year; clearly, one of these has to be the winner in that category, and so the basket of that movie will also contain a win in that category.

It's easy to extend the Market Baskets model to capture this idea. I'll call the new model Constrained Market Baskets. Consider a subset S of market baskets; say, the set of market baskets corresponding to the "rich" movies with 5 or more nominations. Suppose movie M is in this set, and has been nominated in award category C. If there are (say) a total of 5 nominees in this category, then the prior probability of movie M's basket containing a win is 1/5 or 0.2. We can repeat this for all the categories M is nominated in, and add up the priors; this gives the "prior expected value" of the number of wins in M's basket. We add up the expected wins for all the movies in set S to get the total number of wins we expect the set S of movies to have; call this EW. Now, if OW is the actual number of "Observed Wins" across the movies in set S, we want to see if there is a discrepancy between EW and OW. In particular, we define the "win boost" of set S to be OW/EW. If the win boost is higher than 1, then the set S of market baskets has a disproportionate number of wins, and if it's much less than 1, then it has fewer wins than expected.

When we do the analysis, the set of "poor" movies, with 4 or fewer nominations, had a total of
5382 nominations, with 1143 "expected wins" but only 840 "observed wins"; a win boost of 0.73. The "rich" movies, by contrast, with 3234 nominations, were expected to win 657 Oscars but actually won 958, a win boost of 1.46. In other words: the rich movies, which represent only 37.5% of all nominations, actually won more than half of all the actual Oscar awards! Matthew!

Once again, we can break up the results by category, and look at the win boosts for specific categories of awards. For most major award categories, the win boosts for the rich and poor categories are in line with the overall average boosts. As in the case of nominations, the effect is very significant in the best picture and best director categories: in these categories, the "poor" movies have a win boost of just 0.30! We noted that the Music category seemed resilient to Matthew in the case of nominations; but in the case of wins, this category has a win boost of 1.7 for the rich movies, in line with the overall average. The surprising and significant outlier in this case is the Best Supporting Actor category, with win boosts very close to 1.0 for both the rich and the poor movies. It appears that the Best Supporting Actor award shows no evidence of Matthew; the other acting categories, however, are in line with the overall averages.

I don't have a deep enough understanding of the movie industry and the Academy Awards process to speculate on the reasons for these effects. Perhaps great talent attracts other great talent, and the Awards reflect that reality. And perhaps the difference between the behavior of wins and of nominations has to do with the fact that the former uses simple plurality voting while the latter uses a preferential voting scheme. In any case, I'm happy on two counts. The statistics on the Music category say that the Matthew effect likely did not help Mr Rahman in securing his nominations; but now that he has been nominated, his chances of winning are greatly boosted because he is associated with Slumdog's 10 nominations. Jai Ho!

Update: A big night for Slumdog, winning 8 awards, including both the music and song awards for A. R. Rahman. While 8 awards is not the best Oscar performance ever, it is the most number of awards won by a movie with 10 nominations (the ones that won more awards had more nominations). Matthew must be pleased.

February 21, 2009 in Data Mining | Permalink | Comments (13) | TrackBack (0)

Kosmix Adds Rocketfuel to Power Voyage of Exploration

Kosmix_logo_betaish  

Today I'm delighted to share some fantastic news. My company Kosmix has raised $20 million in new financing to power our growth. Even more than the amount of financing, I'm especially proud that the lead investor in this round is Time Warner, the world's largest media company. Our existing investors Lightspeed, Accel, and DAG participated in the round as well. The Kosmix team also is greatly strengthened by the addition of Ed Zander as investor and strategic advisor. In an amazing career that spans Sun Microsystems and Motorola, Ed has repeatedly demonstrated leadership that grew good ideas into great products and businesses. His counsel will be invaluable as we take Kosmix to the next level as a business.

In these perilous economic times, the funding is a big vote of confidence in Kosmix's product and business. Kosmix web sites attract 11 million visits every month, and we have a proven revenue model with significant revenues and robust growth. RightHealth, the proof-of-concept we launched in 2007, grew with astonishing rapidity to become the #2 health web site in the US. These factors played a big role in helping us close this round of funding with a healthy uptick in valuation from our prior round. Together with the money already in the bank from our prior rounds, we now have more than enough runway to take the company to profitability and beyond.

A few months ago, we put out an alpha version of Kosmix.com. Many people used it and gave us valuable feedback; thank you! We listened, and made changes. Lots of changes. The result is the beta version of Kosmix.com, which we launched today. What's changed? More information sources (many thousands), huge improvements in our relevance algorithms, a much-improved user interface, and a completely new homepage. Give it a whirl and let us know what you think.

To those of you new to Kosmix, the easiest way to explain what Kosmix does is by analogy. Google and Yahoo are search engines; Kosmix is an explore engine. Search engines work really well if your goal is to find a specific piece of information -- a train schedule, a company website, and so on. In other words, they are great at finding needles in the haystack. When you're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:

  • Looking to bake a chocolate cake? We have recipes, nutrition information, a dessert burn rate calculator, blog posts from chow.com, even a how-to video from Martha Stewart!
  • Loved one diagnosed with diabetes? Doctor-reviewed guide, blood sugar and insulin pump slide shows, calculators and risk checkers, quizzes, alternative medications, community.
  • Traveling to San Francisco? Maps, hotels, events, sports teams, attractions, travel blogs, trip plans, guidebooks, videos.
  • Writing an article on Hillary Clinton? Bio, news, CNN videos, personal financial assets and lawmaker stats, Wonkette posts, even satire from The Onion.
  • Into Radiohead? Bio, team members, albums, tracks, music player, concert schedule, videos, similar artists, news and gossip from TMZ.
  • Follow the San Francisco 49ers? Players, news from Yahoo Sports and other sources, official NFL videos and team profiles, tickets, and the official NFL standings widget.


In the examples above, I'm especially pleased about the way Kosmix picks great niche sources for topics. For example, I hadn't heard about chow.com or known that Martha Stewart has how-to videos on her website. Other "gems" of this kind include Jambase, TMZ, The Onion, DailyPlate, MamaHerb, and Wonkette. Part of the goal of Kosmix is to bring you such gems: information sources or sites you have either not heard of, or just not thought about in the current context.

In other words: Google = Search + Find. Kosmix = Explore + Browse.  Browsing sometimes uncovers surprising connections that you might not even have thought about. The power of the model was brought home to me last week as I was traveling around in England. I'd heard a lot about Stonehenge and wanted to visit; so of course I went to the Kosmix topic page on Stonehenge. In addition to the usual comprehensive overview of Stonehenge, the topic page showed me places to stay in Bath, Somerset (which happens to be the best place to stay when you're visiting Stonehenge). It also showed me other ancient monuments in the same area I could visit while I was there. Score one for serendipity. 

Some of us remember the early days of the World Wide Web: the thrill of just browsing around, following links, and discovering new sites that surprise, entertain, and sometimes even inform. We have lost some of that joy now with our workmanlike use of search engines for precision-guided information finding. We built the new Kosmix homepage to capture some of the pleasure of aimless browsing -- exploring for pure pleasure. The homepage shows you the hot news, topics, videos, slide shows, and gossip of the moment. If you find something interesting you can dive right in and start browsing around that topic. We compile this page in the same manner as our topic pages: by aggregating information for many other sources and then applying a healthy dose of algorithms. Dig in; who knows what surprises await?

How does Kosmix work its magic? As I wrote when we put out the alpha, the problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. The web has evolved from a collection of documents that neatly fit in a search engine index to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases. The secret sauce is our algorithmic categorization technology. Given a topic, categorization tells us where the topic fits in a really big taxonomy, what the related topics are, and so on. In turn, other algorithms use this information to figure out the right set of information sources for a topic from among the thousands we know about. And then other algorithms figure out how to lay the information on the page in a 2-dimensional grid.

While we are proud of what we have built, we know there is still a long way to go. And we cannot do it without your feedback. So join the USS Kosmix on our maiden voyage. Our mission: to explore strange newtopics; to discover surprising new connections; to boldly go where no search engine has gone before!

Update: Vijay Chittoor has posted more details on the new product features on the Kosmix blog. Coverage on TechCrunch, GigaOM, VentureBeat. I'm particularly pleased that Om Malik thinks his page on Kosmix is better than the bio on his site!

December 08, 2008 in Data Mining, kosmix, Search | Permalink | Comments (3) | TrackBack (0)

Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:

  • The restriction to a rigid data flow model (Map followed by Reduce). Sometimes you need other flows e.g., map-reduce-map, union-map-reduce, join-reduce.
  • Common data analysis operations, which are provided by database systems as primitives, need to be recoded by hand each time in Java or C/C++: e.g., join, filter, common aggregates, group by, union, distinct. 
  • The programmer has to hand-optimize the execution plan, for example by deciding how many map and reduce nodes are needed. For complex chained flows, this can become a nightmare. Databases provide query optimizers for this purpose -- the precise sequence of operations is decided by the optimizer rather than by a programmer.

Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.

The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using such a scripting language, one can express operations that require multiple map reduce steps, together with joins and other set-oriented data processing operations. This approach is exemplified by Pig Latin, being developed by a team at Yahoo. PigLatin provides primitive operations that are commonly found in database systems, such as Group By, Join, Filter, Union, ForEach, and Distinct. Each PigLatin operator can take a User Defined Function (UDF) as a parameter.

The programmer creates a script that chains these operators to achieve the desired effect. In effect, the programmer codes by hand the query execution plan that might have been generated by a SQL engine. The effect of a single Map Reduce can be simulated by a Filter step followed by a Group By step. In many common cases, we don't even need to use UDFs, if the filtering and grouping criteria are straightforward ones that are supported in PigLatin. The PigLatin engine translates each script into a sequence of jobs on a Hadoop cluster. The PigLatin team reports that 25% of Hadoop jobs on Yahoo today originate as PigLatin scripts. That's impressive adoption.

Another interesting solution in this category is Sawzall, a new scripting language developed at Google. Sawzall allows map reduce operations to be coded using a language that is reminiscent of awk. If your computation fits the Sawzall model, the code is much shorter and more elegant than C/C++/Java Map and Reduce functions. Sawzall, however, suffers from two drawbacks: it limits the programmer to a prefined set of aggregations in the Reduce phase (although it supplies a big library of these); and it offers no support for data analysis that goes beyond a single Map Reduce step, as PigLatin does. Most important, Sawzall is not available outside of Google, while PigLatin has been open-sourced by Yahoo.

The second approach is to integrate Map Reduce with a SQL database. Two database companies have recently announced support for MapReduce: Greenplum and Aster Data. Interestingly, they have taken two very different approaches. I will call Greenplum's approach "loose coupling" and Aster Data's approach "tight coupling". Let's examine each in turn.

Greenplum's loose-coupling approach ties together Greenplum's database with Hadoop's implementation of Map Reduce. A Hadoop Map Reduce operation is visible as a database view within Greenplum's SQL interpreter. Conversely, Hadoop map and reduce functions can access data in the database by iterating over the results of database queries. Issuing a SQL query that uses a map-reduce view will launch the corresponding map-reduce operation, whose results can then be processed by the rest of the SQL query.

Aster Data's tight-coupling approach is more interesting: the database natively supports map reduce (with no need for Hadoop). Map and reduce functions can be written in a variety of programming languages (C/C++, java, python). Aster has extended the SQL language itself to support how these functions get invoked, creating a new SQL dialect called SQL/MR. One of the cool features is that map and reduce functions are automatically polymorphic, just like native SQL functions such as SUM, COUNT and so on: the programmer can write them once and the database engine can invoke them with rows with different numbers of columns and columns of different types. This is a huge convenience over the Hadoop approach.

What are the pros and cons of these three different approaches? The advantage of the Pig Latin approach is that it works directly at the file level, and therefore it can express MapReduce computations that don't fit the relational data model. An example of such an operation is building an inverted index on a collection of text documents. Databases in general are bad at handling large text and image data, which are treated as "blobs."

The biggest disadvantages of the PigLatin approach is the need to learn an entirely new programming language. There is a large group of developers and DBA's familiar with SQL, and PigLatin does not have this support base. The second disadvantage is that the developer has to code declarative query plans by hand, while SQL programmer can rely on two decades of work on SQL query optimizers, which can automatically decide the order of operations, the degree of parallelism, and when to use indexes.

The advantages and disadvantages of the SQL integration approach in general mirror those of the Pig Latin approach. The loose coupling approach of Greenplum allows the use of files as well as relations, and therefore in principle supports file-based computations. The burden is on the application programmer, however, to decide on the scheduling and optimization of the Hadoop portion of the computation, without much help from the database.

Aster's tight-coupling approach, on the other hand, allows a much greater degree of automatic query optimization. The database system is intimately involved in the way map and reduce operations are scheduled across the cluster, and can decide on the degree of parallelism, as well use strategies such as pipelining across map reduce and relational operators. In addition, since the database system is solely in charge of overall resource allocation and usage, it also ensures sandboxing of user-defined code, preventing it from consuming too many resources and slowing down other tasks. For computations that use only data in the relational database, Aster by far has the most elegant solution; the weakness, of course, is that data stored outside the database is off-limits. 

Update: Tassos Argyros from Aster Data points out that Aster's implementation does in fact allow access to data stored outside the database. The developer needs to write a UDF that exposes the data to the database engine.

All three approaches thus have their strengths and weaknesses. It's exciting to see the emergence of fresh thinking on data analytics, going beyond the initial file-oriented Map Reduce model. Over time, these approaches will evolve, borrowing learnings from one other. In time one or more will become the dominant paradigm for data analytics; I will be watching this space with great interest.

Disclosure: I'm an investor in Aster Data and sit on their Board of Directors. 

September 05, 2008 in Data Mining | Permalink | Comments (14) | TrackBack (0)

Change the algorithm, not the dataset

Mayank Bawa over at the Aster Data blog has posted a great riff on one of my favorite themes: using simple algorithms to analyze large volumes of data rather than more sophisticated algorithms that cannot scale to large datasets.

Often, we have a really cool algorithm (say, support-vector machines or singular value methods) that works only on main-memory datasets. In such cases, the only possibility is to reduce the data set to a manageable size through sampling. Mayank's post illustrates the dangers of such sampling: in businesses such as advertising, sampling can make the pattern you're trying to extract so weak that even the more powerful algorithm cannot pick it up.

For example, say 0.1% of users exhibit a certain kind of behavior. If you start with 100 million users, and then take a 1% sample, you might think you are OK because you still have 1 million users in your sample. But now just 1000 users in the sample exhibit the desired behavior, which may be too small for any algorithm to pick apart from the noise. In fact, it's likely to be below the support thresholds of most algorithms. The problem is, these 0.1% of users might represent a big unknown revenue opportunity.

Moral of the story: use the entire data set, even it if it is many terabytes. If your algorithm cannot handle a dataset that large, then change the algorithm, not the dataset.

June 12, 2008 in Data Mining | Permalink | Comments (5) | TrackBack (0)

How Google Measures Search Quality

This post continues my prior post Are Machine-Learned Models Prone to Catastrophic Errors. You can think of these as a two-post series based on my conversation with Peter Norvig. As that post describes, Google has not cut over to the machine-learned model for ranking search results, preferring a hand-tuned formula. Many of you wrote insightful comments on this topic; here I'll give my take, based on some other insights I gleaned during our conversation.

The heart of the matter is this: how do you measure the quality of search results? One of the essential requirements to train any machine learning model is a a set of observations (in this case, queries and results) that are tagged with "scores" that measure the goodness of the results. (Technically this requirement applies only to so-called "supervised learning" approaches, but those are the ones we are discussing here.) Where to get this data?

Given Google's massive usage, the simplest way to get this data is from real users. Try different ranking models on small percentages of searches, and collect data on how users interacted with the results. For example, how does a new ranking model affect the fraction of users who click on the first result? The second? How many users click to page 2 of results? Once a user clicks out to result page, how long before they click the back button to come back to the search results page?

Peter confirmed that Google does collect such data, and has scads of it stashed away on their clusters. However -- and here's the shocker -- these metrics are not very sensitive to new ranking models! When Google tries new ranking models, these metrics sometimes move, sometimes not, and never by much. In fact Google does not use such real usage data to tune their search ranking algorithm. What they really use is a blast from the past. They employ armies of "raters"  who rate search results for randomly selected "panels" of queries using different ranking algorithms. These manual ratings form the gold-standard against which ranking algorithms are measured -- and eventually released into service.

It came as a great surprise to me that Google relies on a small panel of raters rather than harness their massive usage data. But in retrospect, perhaps it is not so surprising. Two forces appear to be at work. The first is that we have all been trained to trust Google and click on the first result no matter what. So ranking models that make slight changes in ranking may not produce significant swings in the measured usage data. The second, more interesting, factor is that users don't know what they're missing.

Let me try to explain the latter point. There are two broad classes of queries search engines deal with:

  • Navigational queries, where the user is looking for a specific uber-authoritative website. e.g., "stanford university". In such cases, the user can very quickly tell the best result from the others -- and it's usually the first result on major search engines.
  • Informational queries, where the user has a broader topic. e.g., "diabetes pregnancy". In this case, there is no single right answer. Suppose there's a really fantastic result on page 4, that provides better information any of the results on the first three pages. Most users will not even know this result exists! Therefore, their usage behavior does not actually provide the best feedback on the rankings.

Such queries are one reason why Google has to employ in-house raters, who have been instructed to look at a wider window than the first 10 results. But even such raters can only look at a restricted window of results. And using such raters also makes the training set much, much smaller than could be gathered from real usage data. This fact might explain Google's reluctance to fully trust a machine-learned model. Even tens of thousands of professionally rated queries might not be sufficient training data to capture the full range of queries that are thrown at a search engine in real usage. So there are probably outliers (i.e., black swans) that might throw a machine-learned model way off.

I'll close with an interesting vignette. A couple of years ago, Yahoo was making great strides in search relevance, while Google apparently was not improving as fast. Recall then that Yahoo trumpeted data showing their results were better than Google's. Well, the Google team was quite amazed, because their data showed just the opposite: their results were better than Yahoo's. They couldn't both be right -- or could they? It turns out that Yahoo's benchmark contained queries drawn from Yahoo search logs, and Google's benchmark likewise contained queries drawn from Google search logs. The Yahoo ranking algorithm performed better on the Yahoo benchmark and the Google algorithm performed better on the Google benchmark.

Two learnings from this story: one, the results depend quite strongly on the test set, which again speaks against machine-learned models. And two, Yahoo and Google users differ quite significantly in the kinds of searches they do. Of course, this was a couple of years ago, and both companies have evolved their ranking algorithms since then.

June 11, 2008 in Data Mining, Search | Permalink | Comments (15) | TrackBack (0)

Are Machine-Learned Models Prone to Catastrophic Errors?

A couple of days ago I had coffee with Peter Norvig. Peter is currently Director of Research at Google. For several years until recently, he was the Director of Search Quality -- the key man at Google responsible for the quality of their search results. Peter also is an ACM Fellow and co-author of the best-selling AI textbook Artificial Intelligence: A Modern Approach. As such, Peter's insights into search are truly extraordinary.

I have known Peter since 1996, when he joined a startup called Junglee, which I had started together with some friends from Stanford. Peter was Chief Scientist at Junglee until 1998, when Junglee was acquired by Amazon.com. I've always been a great admirer of Peter and have kept in touch with him through his short stint at NASA and then at Google. He's now taking a short leave of absence from Google to update his AI textbook. We had a fascinating discussion, and I'll be writing a couple of posts on topics we covered.

It has long been known that Google's search algorithm actually works at 2 levels:
  1. An offline phase that extracts "signals" from a massive web crawl and usage data. An example of such a signal is page rank. These computations need to be done offline because they analyze massive amounts of data and are time-consuming. Because these signals are extracted offline, and not in response to user queries, these signals are necessarily query-independent. You can think of them tags on the documents in the index. There are about 200 such signals.
  2. An online phase, in response to a user query. A subset of documents is identified based on the presence of the user's keywords. Then, these documents are ranked by a very fast algorithm that combines the 200 signals in-memory using a proprietary formula.
The online, query-dependent phase appears to be made-to-order for machine learning algorithms. Tons of training data (both from usage and from the armies of "raters" employed by Google), and a manageable number of signals (200) -- these fit the supervised learning paradigm well, bringing into play an array of ML algorithms from simple regression methods to Support Vector Machines. And indeed, Google has tried methods such as these. Peter tells me that their best machine-learned model is now as good as, and sometimes better than, the hand-tuned formula on the results quality metrics that Google uses.

The big surprise is that Google still uses the manually-crafted formula for its search results. They haven't cut over to the machine learned model yet. Peter suggests two reasons for this. The first is hubris: the human experts who created the algorithm believe they can do better than a machine-learned model. The second reason is more interesting. Google's search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.

This raises a fundamental philosophical question. If Google is unwilling to trust machine-learned models for ranking search results, can we ever trust such models for more critical things, such as flying an airplane, driving a car, or algorithmic stock market trading? All machine learning models assume that the situations they encounter in use will be similar to their training data. This, however, exposes them to the well-known problem of induction in logic.

The classic example is the Black Swan, popularized by Nassim Taleb's eponymous book. Before the 17th century, the only swans encountered in the Western world were white. Thus, it was reasonable to conclude that "all swans are white." Of course, when Australia was discovered, so were the black swans living there. Thus, a black swan is a shorthand for something unexpected that is outside the model.

Taleb argues that black swans are more common than commonly assumed in the modern world. He divides phenomena into two classes:
  1. Mediocristan, consisting of phenomena that fit the bell curve model, such as games of chance, height and weight in humans, and so on. Here future observations can be predicted by extrapolating from variations in statistics based on past observation (for example, sample means and standard deviations).
  2. Extremistan, consisting of phenomena that don't fit the bell curve model, such as the search queries, the stock market, the length of wars, and so on. Sometimes such phenomena can sometimes be modeled using power laws or fractal distributions, and sometimes not. In many cases, the very notion of a standard deviation is meaningless.
Taleb makes a convincing case that most real-world phenomena we care about actually inhabit Extremistan rather than Mediocristan. In these cases, you can make quite a fool of yourself by assuming that the future looks like the past.

The current generation of machine learning algorithms can work well in Mediocristan but not in Extremistan. The very metrics these algorithms use, such as precision, recall, and root-mean square error (RMSE), make sense only in Mediocristan. It's easy to fit the observed data and fail catastrophically on unseen data. My hunch is that humans have evolved to use decision-making methods that are less likely blow up on unforeseen events (although not always, as the mortgage crisis shows).

I'll leave it as an exercise to the interested graduate student to figure out whether new machine learning algorithms can be devised that work well in Extremistan, or prove that it cannot be done.

May 24, 2008 in Data Mining, Search | Permalink | Comments (28) | TrackBack (0)

Why the World Needs a New Database System

The internet and its usage are generating vast amounts of data. The volume of new data being generated today is unprecedented in human history. Much intelligence might be gleaned by intelligently mining this trove; yet most companies today archive most of their data and process only a tiny sliver of it. For example, most internet companies I know with significant traffic archive all usage data older than a couple of months. Who knows what gems of insight they're missing?

Over the past few years, the hardware platform that can tackle very high data volumes has been identified. It is the cluster of commodity Linux nodes interconnected using commodity gigabit ethernet. Google popularized the platform and introduced Map Reduce, which today is the best tool available for serious data crunching on  massive scale. The Hadoop project makes Map Reduce available to the rest of us.

Yet the Map Reduce paradigm has its limitations. The biggest problem is that it involves writing code for each analysis. This limits the number of companies and people that can use this paradigm. The second problem is that joins of different data sets is hard. The third problem is that Map Reduce works on files and produces files; after a while the number of files multiplies and it becomes difficult to keep track of things. What's lacking is a metadata layer, such as the catalog in database systems. Don't get me wrong; I love Map Reduce, and there are applications that don't need these things, but increasingly there are applications that do.

Database systems have been the workhorses of conventional (small to medium scale) data analysis applications for a couple of decades now. They have some advantages over Map Reduce. SQL is a standard for ad-hoc data analysis without actually writing code. There's a big ecosystem around SQL: people familiar with SQL, tools that work with SQL, application frameworks, and so on. Joins are straightforward in SQL, and databases provide implementations such as hash joins that scale well to big data. There is a large body of knowledge around query optimization that allows declaratively specified SQL queries to be optimized for execution in nontrivial ways. There is support for concurrency, updates, and transactions. There is a metadata catalog that allows people and programs to understand the contents of the database.

No one has been able to make database systems scale to the data volumes we are seeing today at an affordable price point. Sure, there is Teradata, with their custom designed hardware/software solution. They are the only serious game in town for sophisticated analysis of data in the terabytes. But the entry price point is in the several millions of dollars, and if your data size increases, the price per incremental terabyte is also at nosebleed levels. Also, proprietary hardware ignores the inexorable logic of Moore's law.

What the world needs is a database system natively designed and architected from the ground up for the new hardware platform: commodity clusters. Such a system greatly expands the number of companies and people that can do data analysis on terabytes of data. I'm delighted that Aster Data this week launched exactly such a database system this week. I led the seed round investment in Aster Data and sit on their board, so I've been waiting for this launch with great anticipation.

Aster comes with impressive testimonials from early customers. MySpace has been using Aster for several months now to analyze their usage data. Their data set has grown from 1 terabyte to 100 terabytes in just 100 days -- adding 1 terabyte a day. No other database installation in the world has grown at this rate. The growth is enabled by the commodity hardware platform underneath -- just add a server or a rack as the data grows. Another early adopter is Aggregate Knowledge, whose dataset has also grown at an impressive rate.

There's an interesting backstory to my involvement with Aster. When I taught the Distributed Databases course at Stanford in 2004, my TA was a bright young PhD student named Mayank Bawa. I was very impressed with Mayank's intellectual abilities and his can-do attitude. While working on the course we got to discussing how there is a shift happening in hardware platforms, but no database systems vendor has created a system targeting the new hardware platform. A few months later, Mayank told me he intended to start a company to build just such a database, along with two other Stanford graduate students: Tassos Argyros and George Candea. It was a brain dead simple investment decision for me.

Mayank, Tassos and George are among the best thinkers I know on the topic of large scale data analytics. They have been exceedingly thoughtful in the architecture of their new database. They have some really nifty stuff in there, including incremental scalability, intelligent data partitioning, and query optimization algorithms. One of the coolest demos I've seen is when they run a large query that touches terabytes of data, and then kill several of the nodes in the system that are actively processing the query. The system transparently migrates query processing, and out pop the results. This kind of behavior is essential when you're dealing with queries that can take several minutes to hours to execute, increasing the likelihood of failures during query processing, especially when using commodity nodes.

The LAMP stack, with MySQL as the base, has transformed and democratized web application development. In a similar vein, I expect that we will see the emergence of a stack that democratizes large-scale data analytics applications. Aster Data could well be the foundation of that stack.

May 22, 2008 in Data Mining, Internet Infrastructure | Permalink | Comments (10) | TrackBack (0)

Data, Algorithms, and Powerset

Powerset finally launched their search engine last night. It's been over two years in the making, and the company has much hype surrounding their natural language technology. 

The new search engine has received mixed reviews. For example, while Mike Arrington at TechCrunch fell in love at first sight, the comments following the post tell a very different story. The most balanced review I've seen so far is from Danny Sullivan at Search Engine Land.

Disclosure:
My company Kosmix also builds search technology, albeit of a very different kind. I don't consider Kosmix in any way to be a PowerSet competitor, and in this post I'm wearing my blogger hat and not my Kosmix hat. Moreover, this post is not really about PowerSet at all, but about data and algorithms, using PowerSet as the example du jour.

To boil down both my personal experience playing with PowerSet and what I'm seeing in the reviews: some of the features (like the Factz) are cool. The biggest weakness is that PowerSet today searches only two data sets: Wikipedia and Freebase. And while these are both very large and very useful datasets, they're not nearly enough to answer many real-world user queries.

The TechCrunch comments and Danny Sullivan's posts both contain examples that point out the strengths and weaknesses of PowerSet's search. Here's an example from my own personal experience: in my previous post, I used the phrase "new wine in an old bottle". I was curious about the origin of this phrase (since I had also heard people say "old wine in a new bottle" -- which came first?) So I typed the search "origin of expression new wine in old bottle" into both PowerSet and Google. Google nailed it in the first result (from Yahoo Answers), while Powerset was lost. Ditto for "How old was Gandhi when he was killed?"

While people can argue about the applicability of natural language processing to search, to the quality of PowerSet's implementation, and so on, I have a much simpler point of view. I believe PowerSet has some pretty cool IP in its algorithms and has done as good a job as they possibly can with it. The problem is, they don't have enough data to work with. (Another way of saying this is, PowerSet's index is a SubSet of Google's index.)

The primary reason Google's search is useful is that there is lots and lots of data on the web, and Google indexes  so much of it.  Yes, Google's search algorithms are fantastic, but they wouldn't work anywhere as well if they didn't have so much data underneath them. Consider my search about new wine and old bottles. The reason Google nails it is because there's a page on the web that uses the exact phrase I typed into the search box. Same for the Gandhi example. The cool thing for Google is, people who search often use phrases in the same was as people who write web pages (especially on community sites such as Yahoo Answers). Instead of doing any NLP, they just need to index a really huge corpus of text and look for near-exact matches.

To use a phrase from an earlier post on this blog, "More data usually beats better algorithms". And nowhere is that more true than it is in search. (The natural question to ask is, why not both more data and better algorithms? The answer in this case is that many of the techniques seem to rely on the structure and general cleanliness of Wikipedia and Freebase, so it's not clear how well they will scale to the web as a whole.)

What of PowerSet? I for one really love their Wikipedia search, and will use them to search Wikipedia and Freebase. As Danny Sullivan points out in his post, perhaps the right business model for PowerSet lies in enterprise search rather than in web search.

May 12, 2008 in Data Mining, Search | Permalink | Comments (5) | TrackBack (0)

More data beats better algorithm at predicting Google earnings

Readers of this blog will be familiar with my belief that more data usually beats better algorithms. Here's another proof point.

Google announced earnings today, and it was a shocker -- for most of Wall Street, which was in a tizzy based on ComScore's report that paid clicks grew by a mere 1.8% year-over-year. In the event, paid clicks grew by a healthy 20% from last year and revenue grew by 30%.

In comparison, SEM optimizer Efficient Frontier released their Search Performance Report on their blog a few hours ahead of Google's earnings call. EF manages the SEM campaigns of some of the largest direct marketers, handling more SEM spend than anyone in the world outside of the search engines themselves. Their huge volumes of data give them more insight into Google's marketplace than anyone outside of Google.

EF reported a 19.2% increase in paid clicks and 11.2% increase in CPCs at Google Y-O-Y. Do the math (1.192*1.112 = 1.325), that's a 32.5% Y-O-Y revenue increase. That's the closest anyone got to the real numbers!  And this quarter is not a flash in the pan: in January, EF reported a 29% Y-O-Y increase in SEM spend, with 97% of the increased spend going to Google: that is, about a 28% Y-O-Y revenue increase for Google. That compares very favorably with the actual reported increase of 30%.

As Paul Kedrosky points out, this is a huge indictment of ComScore's methodology (ComScore's shares are trading down 8% after-hours post the Google earnings call). ComScore sets a lot of store on their "panel-based" approach, which collects data from a panel of users, similar to Nielsen's method of collecting data on TV viewing using data from a few households that have their set-top boxes installed. ComScore has been in this business longer than anyone else, and has arguably the best methodology (i.e., algorithm) in town to analyze the data. They're just not looking at the right data, or enough of it.  Some simple math using the mountain of data from EF handily beats the analysis methodology developed over several years using data from a not-so-large panel.

To my mind, this also puts in doubt the validity of ComScore's traffic measurement numbers. For websites where I personally know the numbers (based on server logs), both Quantcast and Hitwise come far closer to reality than ComScore. The latter two don't rely as heavily on a small panel. ComScore's value today is largely driven by the fact that advertisers and ad agencies trust their numbers more than the upstarts. Advertiser inertia will carry them for a while; but a few more high-profile misses could change that quickly.

Disclosure: Cambrian Ventures is an investor in EF. However, I don't have access to any information beyond that published in their public report.

April 18, 2008 in Advertising, Data Mining, Search | Permalink | Comments (6) | TrackBack (0)

More data usually beats better algorithms, Part 2

The post More Data Beats Better Algorithms generated a lot of interest and comments. Since there are too many comments to address individually, I'm addressing some of them in this post.

1. Why should we have to choose between data and algorithms? Why not have more data and better algorithms?

A. There are two parts to the answer. The first is resource limits; if, for example, you are a startup that has to choose where to invest resources -- in gathering or licensing data, or hiring the next PhD -- then I think beyond a certain level you are better off getting more data. If you a startup that doesn't need to make that choice, more power to you.

The second reason is more subtle, and has to do with the structure and computational complexity of algorithms.  To a first approximation, simple data mining algorithms (e.g., page rank; market baskets; clustering)  take time roughly linear in the size of their input data, while more complicated machine learning algorithms take time that is quadratic or even cubic in the size of their input data (e.g., Support Vector Machines; Singular Value Decomposition). As a simple rule of thumb, simple linear algorithms scale to large data sets (hundreds of gigabytes to terabytes), while quadratic and cubic algorithms cannot scale and therefore need some data size reduction as a preprocessing step.

For those familiar with Computer Science basics, scalable algorithms involve only  a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce. Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.

2. Does your argument hold only for the addition of an additional independent data set (as in the IMDB/Netflix case), or does it hold also when we add more of the same kind of data (e.g., more user ratings in the case of Netflix)?

Adding independent data usually makes a huge difference. For example, in the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR  data was an independent data set from the bid data. And Google went even a step further: they became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.

You might think that it won't help much to add more of the same data, because diminishing returns would set in. This is true in some cases; however, in many important cases, adding more of the same data makes a bigger difference than you'd think.  These are  cases where where the application sees the data embedded in a very-high-dimensional space. For example, in the case of Netflix, we can  think of dimensions as users (or movies); and in the case of web search, the dimensions are k-grams of terms.

In such cases, the available data is usually very sparse and has a long-tailed distribution. For example, in the Netflix example, we have ratings available for less than 1% of all movie-user pairs. We can add lots of ratings data before diminishing returns becomes a real issue -- say add 10 times more. In such cases, adding lots more of the same kind of data makes a big difference.

3. Can you provide more specifics about the algorithms used in the student projects?

I don't plan to do so at this time, because I don't have permission from the student teams and don't wish to jeopardize their chances in the Netflix competition.

April 03, 2008 in Data Mining | Permalink | Comments (11) | TrackBack (0)

More Posts »

About

  • Anand Rajaraman
  • Datawocky

Recent Posts

  • Stanford Big Data Course Now Open to the World!
  • Goodbye, Kosmix. Hello, @WalmartLabs
  • Retail + Social + Mobile = @WalmartLabs
  • Creating a Culture of Innovation: Why 20% Time is not Enough
  • Reboot: How to Reinvent a Technology Startup
  • Oscar Halo: Academy Awards and the Matthew Effect
  • Kosmix Adds Rocketfuel to Power Voyage of Exploration
  • For Startups, Survival is not a Strategy
  • Google Chrome: A Masterstroke or a Blunder?
  • Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Recent Comments

  • mona on Stanford Big Data Course Now Open to the World!
  • Voyager on Stanford Big Data Course Now Open to the World!
  • Gautam Bajekal on Stanford Big Data Course Now Open to the World!
  • online jobs on Not all powerlaws have long tails
  • rc helicopter on Not all powerlaws have long tails
  • tory burch outlet on Goodbye, Kosmix. Hello, @WalmartLabs
  • SHARETIPSINFO on Goodbye, Kosmix. Hello, @WalmartLabs
  • Almeda Alair on Goodbye, Kosmix. Hello, @WalmartLabs
  • discount mbt on Retail + Social + Mobile = @WalmartLabs
  • custom logo design on Retail + Social + Mobile = @WalmartLabs

Archives

  • September 2014
  • May 2011
  • April 2011
  • April 2009
  • February 2009
  • December 2008
  • November 2008
  • September 2008
  • July 2008
  • June 2008

More...

Blogroll

  • The Numbers Guy
  • Paul Kedrosky's Infectious Greed
  • Life in the Bit Bubble
  • Kosmix Blog
  • John Battelle's Searchblog
  • GigaOM
  • Geeking with Greg
  • Efficient Frontier Insights
  • Data Mining Research
  • Constructive Pessimist, Cynical Optimist

 Subscribe in a reader

Subscribe to Datawocky by Email

Popular Posts

  • Are Machine-Learned Models Prone to Catastrophic Errors?
  • Why the World Needs a New Database System
  • Why Yahoo Glue is a Bigger Deal than You Think
  • The story behind Google's crawler upgrade
  • Affinity and Herding Determine the Effectiveness of Social Media Advertising
  • More data usually beats better algorithms, Part 2
  • More data usually beats better algorithms
  • How Google Measures Search Quality
  • Angel, VC, or Bootstrap?
  • India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime

Categories

  • Advertising (6)
  • Data Mining (11)
  • Entrepreneurship: views from the trenches (2)
  • India (5)
  • Internet Infrastructure (3)
  • kosmix (2)
  • Lewis Carroll (1)
  • Mobile (6)
  • Search (10)
  • Social Media (2)
  • Venture Capital (4)
See More

Twitter Updates

    follow me on Twitter