Data, Algorithms, and Powerset

Powerset finally launched their search engine last night. It's been over two years in the making, and the company has much hype surrounding their natural language technology. 

The new search engine has received mixed reviews. For example, while Mike Arrington at TechCrunch fell in love at first sight, the comments following the post tell a very different story. The most balanced review I've seen so far is from Danny Sullivan at Search Engine Land.

Disclosure:
My company Kosmix also builds search technology, albeit of a very different kind. I don't consider Kosmix in any way to be a PowerSet competitor, and in this post I'm wearing my blogger hat and not my Kosmix hat. Moreover, this post is not really about PowerSet at all, but about data and algorithms, using PowerSet as the example du jour.

To boil down both my personal experience playing with PowerSet and what I'm seeing in the reviews: some of the features (like the Factz) are cool. The biggest weakness is that PowerSet today searches only two data sets: Wikipedia and Freebase. And while these are both very large and very useful datasets, they're not nearly enough to answer many real-world user queries.

The TechCrunch comments and Danny Sullivan's posts both contain examples that point out the strengths and weaknesses of PowerSet's search. Here's an example from my own personal experience: in my previous post, I used the phrase "new wine in an old bottle". I was curious about the origin of this phrase (since I had also heard people say "old wine in a new bottle" -- which came first?) So I typed the search "origin of expression new wine in old bottle" into both PowerSet and Google. Google nailed it in the first result (from Yahoo Answers), while Powerset was lost. Ditto for "How old was Gandhi when he was killed?"

While people can argue about the applicability of natural language processing to search, to the quality of PowerSet's implementation, and so on, I have a much simpler point of view. I believe PowerSet has some pretty cool IP in its algorithms and has done as good a job as they possibly can with it. The problem is, they don't have enough data to work with. (Another way of saying this is, PowerSet's index is a SubSet of Google's index.)

The primary reason Google's search is useful is that there is lots and lots of data on the web, and Google indexes  so much of it.  Yes, Google's search algorithms are fantastic, but they wouldn't work anywhere as well if they didn't have so much data underneath them. Consider my search about new wine and old bottles. The reason Google nails it is because there's a page on the web that uses the exact phrase I typed into the search box. Same for the Gandhi example. The cool thing for Google is, people who search often use phrases in the same was as people who write web pages (especially on community sites such as Yahoo Answers). Instead of doing any NLP, they just need to index a really huge corpus of text and look for near-exact matches.

To use a phrase from an earlier post on this blog, "More data usually beats better algorithms". And nowhere is that more true than it is in search. (The natural question to ask is, why not both more data and better algorithms? The answer in this case is that many of the techniques seem to rely on the structure and general cleanliness of Wikipedia and Freebase, so it's not clear how well they will scale to the web as a whole.)

What of PowerSet? I for one really love their Wikipedia search, and will use them to search Wikipedia and Freebase. As Danny Sullivan points out in his post, perhaps the right business model for PowerSet lies in enterprise search rather than in web search.

Why Yahoo Glue is a Bigger Deal than You Think

Yahoo's India team quietly launched Yahoo Glue out of the glare of the media circus around Yahoo these days.  Glue has been noticed by a few commentators (e.g., TechCrunch), who mostly see it as response to Google's Universal Search (or Ask's search interface). They might be missing the point. Glue is fundamentally different from Universal Search and represents a whole new way of thinking about search.

The fundamental distinction is: where do the results come from? Google's Universal Search searches across Google's properties: web search, image search, YouTube, Scholar, Google News, and so on.  Glue, on the other hand, includes not just Yahoo properties (web search, Yahoo! Answers, Flickr), but also pulls in results from WebMD, HowStuffWorks, and even (ironically) Google Blog Search and YouTube. For example, compare Universal Search (diabetes) with Glue (diabetes). In this respect, Glue bears more similarity to mashups such  as Addict-o-Matic than to Universal Search.

The Web today is a far different place from what it was when Google's search paradigm was invented. The web was then a collection of documents; it is now a collection of applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Each application has its own deep collection of data, and we tend to think of them as being different information types rather than just "web pages". Yet the search model flattens each of these rich interactive services into a collection of web pages that can be indexed -- that's really putting very new wine in a very old bottle.

The one-index-fits-all model forces a linear ranking of incomparable types, such as images, videos, how-to content, facts, and opinion. That's really comparing apples to oranges. The correct way is to deal with each of these information sources as a first class citizen, with its own kind of data and interaction paradigm. Let them compete for real estate on a 2-dimensional search results page (as opposed to 1-dimensional list), and may the best ones win.

Of course, there are many technical challenges to be solved in getting there.  Sending each query to every web application is a recipe for disaster; and dealing with many different APIs is not conducive to scaling. To keep themselves relevant, however, search engines must evolve from indexes to intelligent routers of searches to third-party applications. 

Removing Friction at the Bottom of the Pyramid

Dharavi is a slum of over a million souls packed into a single square mile in the heart of Mumbai, India. It holds the dubious distinction of being the largest slum in Asia.  Families of 15 crowd into 300-square-feet tenements, sharing the space with many more mice. It's hard to imagine the people of Dharavi as consumers moving up the economic ladder, but that is precisely what is happening.

Kashyap Deorah moved back to India from the US recently, and has spent the last several months visiting Dharavi to understand its micro-economy. Every street in Dharavi is home to an electronics dealer. The main business is used cell phones and prepaid SIM cards; India now has over 246 million cell phone subscribers, with the number growing at a scorching pace. One of the hottest items is -- hold your breath -- used flat screen LCD televisions! Surprising, yet clear enough when you think about it: space is at a premium, so slum-dwellers behave rationally in opting for flat screen televisions.

This anecdote illustrates the demand for used consumer durables of all kinds among theAspirers_2 upwardly-mobile masses in India. McKinsey has published a fantastic study on how the rapidly expanding Indian economy is creating new consumers. This study divides Indian households into 5 segments based on household income: Globals, Strivers, Seekers, Aspirers, and Deprived. The Globals are the super-rich elite; the Strivers and Seekers constitute the middle class; and the Deprived are the destitute  outside the pale of consumption.

The most interesting class are the Aspirers: these are not quite destitute or middle-class, but are upwardly mobile and aspire to enter the middle class. Today, only 5% of Indian households are in the middle class, 41% are Aspirers, and 54% are Deprived. In 2025, the study expects the middle class to have swelled to 41% of households, the Aspirers to remain steady at 36%, while the Deprived drop to 22% of households. What is happening is a massive shift of households from Aspirers to middle class and from Deprived to Aspirers. The people of Dharavi are among today's Aspirers and tomorrow's middle class.

Aspirers cannot afford new cell phones, or televisions, or washing machines. But there is huge demand for used cell phones, televisions, and consumer durables of all kinds. In India today, the market for used consumer durables is extremely inefficient, and relies primarily on word-of-mouth. A free flow of information about demand and supply can make the market efficient, and also help millions of people take their first steps to becoming consumers.

Given the almost universal penetration of cell phones among Aspirers in India, the natural solution would seem to be a solution that uses mobile phones to help people buy and sell used goods. Kashyap has started a company named Chaupaati Bazaar (named after a famous beachside bazaar in Mumbai) to do just this. The problem is challenging: create a used goods market, make it work entirely through SMS and voice (no web interface), make it work for a semi-literate user base speaking many languages, and figure out the business model. Not a challenge for the faint-hearted, but something that could really make a difference if it can be made to work. I'm proud to join Chaupaati's Board of Directors as its lead investor.

Dr. C.K. Prahalad is famous for coining the phrase "fortune at the bottom of the pyramid." The bottom of the pyramid consists of the poorest section of the world's population, who are not viewed as a viable market by most consumer products companies. There are 4 billion people at the bottom of the pyramid, out a world population of 6 billion. This is an opportunity to use technology to eliminate friction at the bottom of the pyramid, enabling at least some of those people to climb the rung to the middle class.

Using Twitter to Share Factoids

I started using Twitter a couple of weeks ago. Many people use Twitter to post status updates, but here's how I've decided to use it: I'll post factoids about data I find interesting: either through personal observation, or because I read or heard it somewhere. For example, I spent the day today at the Digital Hollywood conference in LA, where I got an opportunity to post (from my cell phone) factoids as I heard them from various speakers and panelists. I was also on a panel called "Advertising NEXT" at Digital Hollywood today, on which more in a separate post.

My Twitter posts appear on the right hand column of the blog. You can also see them (and, if you like, subscribe to them either via RSS or your mobile) via this link. And let me know what you think.

Is Search Advertising a Giffen Good?

The Giffen good is a strange beast from economic theory.  For most goods, demand decreases as price increases. A Giffen good defies this normal market behavior -- the demand for it increases even as its price increases.

Giffen goods have a very interesting history. They were postulated originally by Alfred Marshall in his 1895 book The Principles of Economics. The classic example is staple foods such as rice, wheat, and potatoes. As their price goes up, poor people on a tight budget actually consume more of them, because they are forced to cut back on luxuries such as meat, but still need the same number of calories to survive. Until recently, Giffen goods remained a theoretical beast, with no real documented examples -- until 2007, when two Harvard economists demonstrated that rice and noodles behave as Giffen goods in certain poor parts of China.

Google's recent results raise the possibility that search advertising might be a Giffen good. Here's a simple model. Company X spends marketing dollars on two channels: search advertising and brand advertising (on the web or on TV and magazines). Search advertising drives customers directly to their site, resulting in immediate sales. Brand advertising drives organic traffic, albeit in a more unmeasurable way.

In an economic downturn,  companies get more cautious with their marketing budgets, moving more dollars into measurable and direct channels such as search advertising while cutting back on less-measurable brand advertising. Thus, there is more competition for the clicks, driving up the price (cost-per-click, or CPC) of search ads.

Company X, therefore, finds all their increased spend on search marketing actually drives the same or even fewer visitors to their site. At the same time, since they have cut back on brand advertising, organic traffic is decreasing. But wait -- we need to make this quarter's numbers! The easiest way to do that is cut back even more on brand advertising and channel even more dollars into search, which can drive immediate clicks towards the end of the quarter. Brand marketing's ROI is longer-term, while this quarter's revenue is a more pressing concern.

Witness the result: company X spends more on search marketing, driving more search ad clicks to its site, at a higher price point.  The definition of a Giffen good! Interestingly, unlike the rice-and-noodles example, the increased consumption directly leads to the increased price, because of the auction pricing model.

Google's recent results seem to confirm this hypothesis:  paid clicks increased by 20% from Q1 2007, while ad revenues increased by 40%, implying a CPC increase of 16%.  Of course, there's a limit to this phenomenon: companies cannot pay for more their ad clicks than their profit margins allow. Until that time, the sucking sound you hear is everyone's profit margins going into Google. We're going to see a lot of low-margin revenue increases at online retailers and other companies that rely on paid search.

Network effects propel cloud computing

Erik Schonfeld over at TechCrunch has a very interesting post  revealing that big enterprises are adopting Amazon's EC2/S3 services at a far faster pace than previously imagined:

A high-ranking Amazon executive told me there are 60,000 different customers across the various Amazon Web Services, and most of them are not the startups that are normally associated with on-demand computing. Rather the biggest customers in both number and amount of computing resources consumed are divisions of banks, pharmaceuticals companies and other large corporations who try AWS once for a temporary project, and then get hooked.

This is epochal stuff -- banks and pharma are notorious late adopters of early-stage technology, so to see them in the vanguard of cloud computing (or perhaps I should say utility computing, but everyone says cloud) is astonishing. But it illustrates a very important detail that's been overlooked: that there are significant network effects to the cloud computing business.

There are two basic underlying forces behind the network effects:

  1. Code that works with large amounts of data needs to be close to the data (in the network topology sense).
  2. Any processing that consumes data generates data.

So, once one enterprising group within a company decides to place some data in S3 and the code to process it on EC2, it becomes a whole lot easier for someone else within the company who needs to run some other code on the data, to move the processing to EC2. And since all this processing generates even more data, we have a virtuous cycle building up. The stable state is for all of a company's data processing tasks to move into the same utility computing cloud, to take advantage of the co-location and minimize data transfer latency and costs.

The network effect extends across companies as well. Often data created by company A is consumed by company B. When this "data interface" is voluminous, it makes economic sense for company B to move into the same utility cloud as company A. There are some ecosystems where utility computing players are already exploiting this trend; for example, AppNexus is creating a utility cloud optimized for the use of ad networks and their associated ecosystem: analytics for publishers and advertisers. There is so much data being shared here (on ad campaigns and their performance) that there is significant advantage to being in the same cloud.

The network effects argument leads to the interesting possibility that cloud computing becomes a winner-take-all game, like auctions; we might end up with one winner (maybe Amazon?) A more likely outcome is, we might end up with a couple of big general-purpose clouds (Amazon and Google, perhaps?) and a few niche clouds optimized for different ecosystems (such as ad networks and social networks).

More data beats better algorithm at predicting Google earnings

Readers of this blog will be familiar with my belief that more data usually beats better algorithms. Here's another proof point.

Google announced earnings today, and it was a shocker -- for most of Wall Street, which was in a tizzy based on ComScore's report that paid clicks grew by a mere 1.8% year-over-year. In the event, paid clicks grew by a healthy 20% from last year and revenue grew by 30%.

In comparison, SEM optimizer Efficient Frontier released their Search Performance Report on their blog a few hours ahead of Google's earnings call. EF manages the SEM campaigns of some of the largest direct marketers, handling more SEM spend than anyone in the world outside of the search engines themselves. Their huge volumes of data give them more insight into Google's marketplace than anyone outside of Google.

EF reported a 19.2% increase in paid clicks and 11.2% increase in CPCs at Google Y-O-Y. Do the math (1.192*1.112 = 1.325), that's a 32.5% Y-O-Y revenue increase. That's the closest anyone got to the real numbers!  And this quarter is not a flash in the pan: in January, EF reported a 29% Y-O-Y increase in SEM spend, with 97% of the increased spend going to Google: that is, about a 28% Y-O-Y revenue increase for Google. That compares very favorably with the actual reported increase of 30%.

As Paul Kedrosky points out, this is a huge indictment of ComScore's methodology (ComScore's shares are trading down 8% after-hours post the Google earnings call). ComScore sets a lot of store on their "panel-based" approach, which collects data from a panel of users, similar to Nielsen's method of collecting data on TV viewing using data from a few households that have their set-top boxes installed. ComScore has been in this business longer than anyone else, and has arguably the best methodology (i.e., algorithm) in town to analyze the data. They're just not looking at the right data, or enough of it.  Some simple math using the mountain of data from EF handily beats the analysis methodology developed over several years using data from a not-so-large panel.

To my mind, this also puts in doubt the validity of ComScore's traffic measurement numbers. For websites where I personally know the numbers (based on server logs), both Quantcast and Hitwise come far closer to reality than ComScore. The latter two don't rely as heavily on a small panel. ComScore's value today is largely driven by the fact that advertisers and ad agencies trust their numbers more than the upstarts. Advertiser inertia will carry them for a while; but a few more high-profile misses could change that quickly.

Disclosure: Cambrian Ventures is an investor in EF. However, I don't have access to any information beyond that published in their public report.

Can SMS be a publishing medium?

When we think of SMS (Short Message Service), we think of short  text messages sent between friends, or to small groups via services like Twitter. This is how the Internet started too -- primarily as a communication medium. But soon the World-Wide Web made the internet a publishing medium as well.

The question is, can SMS messages to mobile phones become a free, ad-supported publishing medium? It seems unlikely, given the limitations: text messages must be no longer than 160 characters. In addition, they also cost both the sender and the receiver.

Necessity, however, is the mother of invention. By a coincidence of three factors, India seems to have the perfect conditions for the emergence of SMS publishing:

  • Huge cell phone penetration -- 246 million and counting.
  • Very low internet penetration -- about 20 million internet connections.
  • A cost structure where senders pay for text messages but recipients don't.

I met today with Rajesh Jain, whose company Netcore offers a service called MyToday in India that is is effect the first SMS publisher. Rajesh Jain is an Internet pioneer, having started India's first internet portal, indiaworld.in, back in 1994 -- he sold it for over $115 million in 1999, in what was perhaps the first big Internet deal in India.

MyToday "publishes" a number of "MyToday Dailies", on topics such as News, Cricket, Health, Gossip, Local News, and so on. You subscribe to one one of these dailies by sending a text message to a published phone number from your cell phone. Subsequently, you will receive a daily (or sometimes, more frequent) text message "articles" on the topic you indicated interest in. The articles are editorially assembled, and also include an embedded ad. How, you might well ask, can you fit in a story and an ad within the 160-character limit? There are two parts to this answer:

  1. You can do a lot within 160 characters, if you try really hard and your expectations are not very high. I remember fitting very playable video games into 16K of RAM on my Sinclair Spectrum when 48K was way too expensive. I looked at some of the MyToday examples, and I'm impressed.
  2. It's possible to split stories across multiple text messages; more and more phones have the ability to handle messages larger than 160 characters; the SMS transport breaks them up into 160-character chunks and the phone re-assembles the message.

The real economic key to this whole enterprise, though, is the cost structure of SMS in India. There you don't pay to receive a message, only to send it. So for subscribers the service is entirely free, after the initial message to subscribe. And the numbers prove it. MyToday has over 3 million unique subscribers, each with 3 subscriptions on average. They send more than 10 million text messages a day, making them India's largest sender of text messages.

For MyToday, ad rates today are still below the cost of sending a text message, but text messaging rates -- especially for huge bulk purchases -- are falling dramatically, so it's not hard to see the crossover happening next year. And that could usher in the era of the SMS publishers.

The story behind Google's crawler upgrade

Alon Halevy and Jayant Madhavan have a post on Google's Webmaster blog disclosing that Google is now harvesting data that is hidden behind HTML forms. I'm very satisfied about this, because it's something that I had a hand in making happen.

I've known Alon since 1995, when he was a researcher and I was his summer intern at Bell Labs in New Jersey (those were the days when Bell Labs was still relevant to Computer Science research). Until that summer, Alon and I had been working independently of each other on the topic of integrating information from across different data sources (such as websites) -- Alon at Bell Labs and I at Stanford, for my PhD dissertation. We put together our ideas and created the Information Manifold, an information integration system that introduced a key idea: you could describe each data source (declaratively) by a local data schema; map that schema to a global schema that unified all the sources; and process queries across the global schema.

This task is complicated by the fact that each data source exposes different query processing capabilities, by virtue of the different HTML forms it uses. We came up with some simple solutions to this problem, but several difficulties remained. In the meantime, Alon and I went our ways with our careers; Alon joined the CS faculty the University of Washington, while I started a company, Junglee, that applied information integration ideas to build the first comparison shopping engine and online job bank. Junglee was acquired by Amazon.com in 1998; in 2000, I started a Venture Capital firm called Cambrian Ventures and in late 2004 a startup, Kosmix.

Alon and I kept in touch through all this, and in 2004, he and his student Jayant Madhavan had made great progress on "schema matching", a research area that has application to the HTML forms problem we had first encountered in the Information Manifold. I was very happy to egg them on to commercialize their breakthrough, and to provide the funding for the resulting company, Transformic, through Cambrian Ventures.

Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. Called the Invisible Web, this data is often estimated to be even larger in size and usefulness than the "visible web" that web crawlers usually index. The key problems in indexing the Invisible Web are:

  1. Determining which web forms are worth penetrating.
  2. If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging -- we need to understand the semantics of the input box to guess possible valid inputs.

Transformic's technology addressed both problems (1) and (2).  It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic -- a nice return for Cambrian, but also a great place for the Transformic team to make a real difference with their ideas. The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler. I'm very happy to have played a small role in their success story.

An aside on the world of academic research. Alon and I had some difficulty getting our Information Manifold research published: it was rejected the first time we submitted it to a leading academic conference; we had to address a lot of criticism and skepticism from the establishment, and it was finally published at the VLDB conference in 1996. Remarkably, this paper has since become one of the most cited and influential papers in its field (see this survey, page 52). In 2006, the paper received the 10-year Best Paper Award at VLDB, given retrospectively to the publication from 10 years ago that made the most impact. The moral of the story is, sometimes it pays to swim against the current in academic research; what is fashionable today is rarely what will ultimately make the most impact.

Affinity and Herding Determine the Effectiveness of Social Media Advertising

A recent piece in The Economist raises a provocative question: social networking sites such as Facebook, Targetability_affinity MySpace, and Bebo have grown tremendously in usage, but are they viable businesses? In other words, is it possible to monetize these services in an effective fashion? To answer this question, it helps to take a step back and look at the monetizability of social media as a whole.

Since most social media sites rely on advertising revenues, let us restrict ourselves to advertising as the monetization mechanism. Regardless of the model (CPM, CPC, CPA), advertisers value three key measures: reach, frequency, and targeting. Many social media sites certainly score high on reach and frequency, but how do they fare on targeting? Targeting is key, because it determines the CPM rates advertisers are willing to pay. And CPM rates vary very widely: from $16-20 for TripAdvisor to $0.10 for Facebook and MySpace. See, for example, this media plan.

What drives such a wide divergence in CPM rates among social media sites? Are the low rates at social networking sites a transient aberration, with higher rates around the corner as advertisers get more comfortable with the medium? And is there a simple model to predict the targetability of different forms of social media?

Remarkably, there appears to be a single factor that explains a great deal of the available data. Consider the difference between a Facebook profile and a TripAdvisor travel review. A typical pageview on the former is by someone known very well to the creator of the profile – a close friend or acquaintance. On the other hand, a TripAdvisor travel review is seen by people completely unrelated in any way to the person or persons who wrote the reviews on the page.

We quantify this distinction with a measure called affinity. The “affinity” of a social media service is the average closeness of relationship between a content creator and someone who views that content.  The affinity of Facebook is very high, while the affinity of TripAdvisor is very low.

Here’s the key observation: There is an inverse relationship between the affinity of a social media service and its targetability. Why is this true? The act of viewing a Facebook profile gives us very little information about the viewer, other than the fact that she is friends with the profile creator; when someone views a TripAdvisor travel review, she is definitely interested in traveling to that location.

I estimated the affinity of several forms of social media, and plotted affinity aginst CPM (which I used as a proxy for targetability). The resulting graph (click for a larger image) shows the landscape of Affinity versus Targetability for several forms of social media. Some of these data points are from published data and others are extrapolated. We can see that there is a strong inverse proportionality, with a couple of outliers. We’ll get to the outliers in a moment; for now, note that Social Networks and Photo Sharing sites are even higher affinity (and therefore lower targetability) than email. This is because we often email people we don’t know or know only in passing. Instant messaging has the very highest of affinities: my IM buddy list includes only my very closest friends, who I trust with the ability to interrupt me any time of the day.

What about the outliers? Video sharing sites, such as YouTube, have low affinity, because the majority of people see videos posted by people they don’t know. However, the targetability is lower than we would expect, because of a compensating factor: herding. Most people see videos featured on lists such as “Most Popular”, which reduces the targeting value of such videos. This is also true of social news sites such as Digg.

A couple of caveats:

  • This is a broad brush-stroke, and individual services might well differ from the overall category. For example, popular blogs have much lower affinity and therefore much higher CPMs than the typical blog.
  • Targetability is not the only factor determining CPM; there are others. For example, certain viewer intents are inherently more valuable than others.

But with these caveats, this simple model is highly instructive. We may conclude that, when all the dust settles, the CPM rates of instant messaging services will not exceed those of social networks, which will not exceed those of email. These are inherently low CPM businesses.

What can social media sites do to increase their CPMs? There appear to be two options:

  • Create sections of the network that are more topic-oriented, and less about individuals. For example, band pages and groups on MySpace, and Facebook groups.
  • Mine individuals’ profiles, or their off-site behaviors, to target them behaviorally rather than contextually. This approach carries with it dangers of privacy violations, as the Facebook Beacon fiasco demonstrates.

If social networks are to become a viable business, they need to pursue aggressively one or both of these approaches. Of course, it may be possible for some services to sidestep this question entirely and develop business models that don’t depend on advertising. We haven’t seen such a model emerge yet, but there is so much creativity and ferment in this space that it might just happen.

Update: I received some questions about the affinity versus targetability landscape. Here's a brief description of the methodology. I used published CPM numbers where they were available; e.g., Yahoo Mail ($3-4), TripAdvisor ($16), Facebook ($0.10-0.15). Note that published CPMs are generally to be taken with a pinch of salt, since they may apply only to small portions of the overall publisher inventory and not represent real market-clearing prices e.g., Google's stated goal of $20 CPM for YouTube -- only a very small number of YouTube videos show ads today. I've used Metacafe's $5 net CPM payout to video producers as a more reasonable benchmark -- this likely represents a gross CPM of $10 assuming a 50% rev share. For blogs, the numbers are all over the place: BlogAds ratecards for various blogs vary from $1-$4CPM, Valleywag reports $6.50-$9.75, and Federated Media has ratecards charging $7-$40. I took $10 to be a median for blogs with reasonably high traffic. Some of the other data points are based on guesses and informal conversations, since these sites typically don't publish their CPMs. Please email me if you have additional data on these; I will update the graph accordingly.