Datawocky

On Teasing Patterns from Data, with Applications to Search, Social Media, and Advertising

Stanford Big Data Course Now Open to the World!

 

 

We live today in a world flooded with data. In just one short decade, we have gone from a data-poor world to a data-rich one. The buzzword Big Data captures this phenomenon, and it’s one of the few cases where the reality actually can match the hype. Big Data is transforming every industry and human activity, including commerce, entertainment, agriculture, government, and the sciences.

For the past decade at Stanford, Jure Leskovec, Jeff Ullman, and I have been teaching a popular course called “Mining of Massive Datasets,” where we teach the fundamental techniques and tools to deal with Big Data. This class has trained a whole generation of data scientists and engineers who work at many major Silicon Valley companies and startups.

The Stanford course is popular, and attracts hundreds of students. But the course textbook, also called Mining of Massive Datasets and published by Cambridge University Press, has been downloaded by hundreds of thousands of students and practitioners. This helped us realize that our Stanford students are just a small fraction of the vast number of people worldwide who might benefit from the course.

So we are now making this class available online, on Coursera, for the entire world. In this class we will introduce fundamental algorithms and techniques to deal with Big Data, such as MapReduce, Locality Sensitive Hashing, Page Rank, and algorithms for Large Graphs and Data Streams. We will also show how to apply our toolkit to important practical applications, such as Web Search, Recommender Systems and Online Advertising.

The class starts September 29 and runs for 9 weeks. One of the key decisions we made is to not “water down” the material in any way from the course we teach at Stanford; the MOOC contains exactly the same material as the Stanford class. You can sign up for the class on the Coursera page. Here’s a short introductory video we recorded for the class.

In addition to the materials provided with this MOOC, the second edition of our textbook is now available for free download here. If you liked the first edition of the book, you should definitely check out the second edition -- we’ve added lots of new material, including graph algorithms, social network analysis, large-scale machine learning, and dimensionality reduction.

Marc Andreessen has famously pointed out that software is eating the world. Data is the fuel that powers software’s conquests. Data is created whenever humans and software interact, or when software interacts with other software. This virtuous cycle -- the success of software creates more data, and more data makes software even more powerful -- is a dynamic that is transforming the world we live in. Join us on Coursera to learn how to harness the power of data so that you can be an active participant, rather than a mere spectator, in this transformation.

September 16, 2014 | Permalink | Comments (3) | TrackBack (0)

Goodbye, Kosmix. Hello, @WalmartLabs

Two weeks ago, I announced the exciting news that Kosmix had agreed to be acquired by Walmart, the world’s largest retailer. At that point, the deal was signed but subject to customary closing conditions. Today, I’m delighted to announce that the closing conditions have been fulfilled and that the deal is officially closed.

Today is the last day in the life of Kosmix as an independent company. It is a day to look back with pride at our accomplishments. I’m proud of our pioneering development of breakthrough semantic analysis technology, and of applying it to real-time social media streams to create the Social Genome. I’m proud that we built RightHealth into one of the top three health and medical information sites by global reach. I’m extremely proud that we touched so many people each and every day: in March, our properties RightHealth, Tweetbeat, and Kosmix.com together served over 17.5 million unique visitors worldwide, who spent over 5.5 billion seconds on our services. But most of all, I’m proud of being part of the Kosmix team, and having the privilege of working with a team of extremely talented and passionate individuals who, over the course of the past several years, went from coworkers to family.

Today also is the first day in the life of @WalmartLabs. As I wrote in my prior post, our mission is to invent the next generation of ecommerce: integrated experiences that leverage the store, the web, and mobile, with social identity being the glue. We are at an inflection point in the development of retailing. Social media and the mobile phone will have as profound an effect on the trajectory of retail in the early years of the 21st century as did the development of highways in the early part of the 20th century. @WalmartLabs, which combines Walmart’s scale with Kosmix’s social genome platform, is in a unique position to invent and build this future.

I’m also delighted that the entire Kosmix crew, without exception, will be part of this exciting new journey. If you are interested in joining a talented and passionate team working at the crossroads of social media, mobile, and commerce, give us a holler. @WalmartLabs is hiring!

P.S. To stay tuned, follow @WalmartLabs on Twitter.

May 03, 2011 | Permalink | Comments (21) | TrackBack (0)

Retail + Social + Mobile = @WalmartLabs

Kangaroo

Eric Schmidt famously observed that every two days now, we create as much data as we did from the dawn of civilization until 2003. A lot of the new data is not locked away in enterprise databases, but is freely available to the world in the form of social media: status updates, tweets, blogs, and videos.

At Kosmix, we’ve been building a platform, called the Social Genome, to organize this data deluge by adding a layer of semantic understanding. Conversations in social media revolve around “social elements” such as people, places, topics, products, and events. For example, when I tweet “Loved Angelina Jolie in Salt,” the tweet connects me (a user) to Angelia Jolie (an actress) and SALT (a movie). By analyzing the huge volume of data produced every day on social media, the Social Genome builds rich profiles of users, topics, products, places, and events.

The Social Genome platform powers the sites Kosmix operates today: TweetBeat, a real-time social media filter for live events; Kosmix.com, a site to discover content by topic; and RightHealth, one of the top three health and medical information sites by global reach.  In March, these properties together served over 17.5 million unique visitors worldwide, who spent over 5.5 billion seconds on our services.

Quite a few of us at Kosmix have backgrounds in ecommerce, having worked at companies such as Amazon.com and eBay. As we worked on the Social Genome platform, it became apparent to us that this platform could transform ecommerce by providing an unprecedented level of understanding about customers and products, going well beyond purchase data. The Social Genome enables us to take search, personalization and recommendations to the next level.

That’s why we were so excited when Walmart invited us to share with them our vision for the future of retailing. Walmart is the world’s largest retailer, with 10.5 billion customer visits every year to their stores and 1.5 billion online – 1 in 10 customers around the world shop Walmart online, and that proportion is growing. More and more visitors to the retail stores are armed with powerful mobile phones, which they use both to discover products and to connect with their friends and with the world. It was very soon apparent that the Walmart leadership shared our vision and our enthusiasm. And so @WalmartLabs was born (official announcement here).

We are at an inflection point in the development of ecommerce. The first generation of ecommerce was about bringing the store to the web. The next generation will be about building integrated experiences that leverage the store, the web, and mobile, with social identity being the glue that binds the experience. Walmart’s enormous global reach and incredible scale of operations -- from the United States and Europe to growing markets like China and India -- is unprecedented. @WalmartLabs, which combines Walmart’s scale with Kosmix’s social genome platform, is in a unique position to invent and build this future.

It is every technologist’s dream that the products they build will impact billions and will continue on to the next generation. The social commerce opportunity is huge, and today is day zero. We have liftoff!

April 18, 2011 | Permalink | Comments (28) | TrackBack (0)

Creating a Culture of Innovation: Why 20% Time is not Enough

Google has garnered a lot of attention and some success with its "20% time" idea, which enables every engineer to spend one day a week working on projects that don't fit in their job description. In my observation, just announcing that every engineer is expected to spend a certain fraction of their time on innovative ideas won't magically lead to innovation. Plus, it's very hard to implement the 20% time model at a startup: most startups just don't have the luxury of 20% excess engineering capacity.

At my (startup) company Kosmix, we take a somewhat different approach to create a culture of innovation, which I described to Taylor Buley of Forbes in a recent video interview. I think the video is terrific, and encourage you to watch it (it's also embedded at the bottom of this post), but there's only so much that can be said in a 90-second video. So I collected together some of my thoughts into this blog post.

At Kosmix, we don't specify a set fraction of time for people to spend on new ideas. Instead, we have focused on creating a culture that engenders new ideas and rewards innovators, encouraging them to tackle new projects above and beyond their 100% contribution to mainline company execution. The three key building blocks that we've used to create a culture of innovation at Kosmix are Team, Environment, and Incentives.

Team.
It always starts with the people. At Kosmix we are fortunate to have a team of rock-star Computer Science graduates from top universities; it's hard to throw a brick without hitting a PhD. Since CS skills are taken for granted, the interview process emphasizes creativity, problem-solving skills, and teamwork skills.

Very importantly, many Kosmixers are multi-dimensional people with interests and passions that extend well beyond work. For example, one of our Operations gurus has a deep interesting in (hold your breath) knitting, and runs knitting classes at work (they're called Knitting Knights). Our office manager also happens to teach Art History.

Environment.
There's something about the graduate school environment that seems to bring out great ideas. Many of the great technology companies (e.g.,Yahoo and Google) have been created by graduate students. We have strived to maintain a grad school environment at Kosmix. Wall around and you'll hear plenty of heated hallway discussions and intellectual free for alls; nerf gun fights erupt over details of relevance algorithms.

When I was a grad student, I used to get ideas for whole new lines of research by attending talks by other students and faculty. The Infolab, the research group I was part of Stanford, has a tradition of Friday lunches where a student leads a discussion on their ongoing work. We have copied this model at Kosmix: every Friday, we have a communal lunch gathering, and a Kosmixer leads a discussion -- either on something cool they've been working on, or on some topic that's just cool but completely unrelated to Kosmix -- such as muscle cars, alternative fuels, or astronomy.

Incentives.
Given the right environment, the next piece is incentives for people to go above and beyond the call of duty. At Kosmix the biggest reward is peer recognition through a system of awards:
  • The Kosmix Kreed award is peer recognition at its purest. Any Kosmixer can nominate any other for doing something interesting and inventive that helps Kosmix users, or for going out of their way to help out another team or person on working on a different project. Giving this award is as easy as sending an email to HR, with a clear description of the achievement that merits the award.
  • The Just Do It! award is given by management, and recognizes an individual who did a substantial project that goes above and beyond their job description. We stole this idea from Amazon.com, where some of us used to work. For example, one recent awardee dreamt up, designed, and implemented the feature that allows users to customize the Kosmix homepage, without any directive from management. Another implemented the ability to edit any topic page on Kosmix.
We also have other awards that recognize teams that execute really well on their core priorities. These awards are read out at monthly company meetings, to warm applause in front of the entire Kosmix team. Each award is also posted on the internal Kosmix blog, which is read by everyone at Kosmix. Awards carry nominal prizes, such as gift certificates; but the real prize is the peer recognition, which acts as a terrific incentive in a high-octane team.

One of the big successes of the Kosmix culture of innovation has been Meehive. A while back, a Kosmix developer thought it would be cool to take Kosmix's core categorization technology and apply it to the problem of filtering news and blogs. He worked on it for a bit to create a first version, which convinced management that this was important enough to create a full team around. We then staffed an official Kosmix project to create Meehive, a personalized newspaper, which we launched last month. You can specify your interests very easily (I have over 40, including technology and cricket), and Meehive scours thousands of newspapers and millions of blogs to create your own personalized newspaper. Early adopters love Meehive; I now use it as my main source of news every day. Check out what people are saying about Meehive on Twitter.

Oh, and by the way, the most recent Just Do It! Award went to a developer on the Meehive team who took it upon himself to create the Meehive iPhone app. It's now rising in popularity among News applications in the app store, and has been a bigger success than any of us imagined. Best of all, no told the developer to do it.

April 15, 2009 in Entrepreneurship: views from the trenches, kosmix | Permalink | Comments (12) | TrackBack (0)

Reboot: How to Reinvent a Technology Startup

Three years ago, Odeo was a struggling startup on a path to nowhere. Odeo's core offering--a set of tools for users to create, record and share podcasts--was facing serious competition from Apple and other heavyweights. The management team made a radical decision to "reboot" the company, and Twitter was born.

As I read the Twitter story, narrated eloquently by Dom Sagolla, I can't help but look back over the many startups that I've been associated with over the past twelve years.  In my various roles as a founder, an investor, a board member, and an advisor to startups in Silicon Valley, I'm constantly fascinated by the mechanics of reinvention. Which approaches to reinvention succeed and which ones fail?

Startups flounder for countless reasons. Perhaps the market opportunity is not as big as imagined, or perhaps there is a mismatch between the technology and the market. Maybe the world changed in some significant way, invalidating the key assumptions on which the startup was based. For example, an established company such as Google or Microsoft might enter the market. Or perhaps the deepest recession in recent history dried up demand for the original product or service. In these cases, the founders and management team have to ask themselves the question: should we push ahead, assuming superior execution will win the day against long odds? Or should we change what we're doing? 

Companies that decide to reinvent need to acknowledge the bad news first: most startups fail, even the reincarated ones.  Those are just the odds. The good news is that certain approaches to reinvention work better than others, and companies can increase their chance of success by carefully calculating their reboot strategy.

Every technology startup has four core components: team, technology/product, market, and business model. Rebooting involves changing at least one of these components, while leaving the other factors unchanged. Let us look at each component in turn:

1. Team. Reinvention usually leads to changes in the team. To qualify as a reboot rather than an entirely new company, however, there must be at least part of the team -- and usually at least one of the founding members -- who continues to remain with the company through the transition. In my experience, one model that usually does not work is when VC investors replace the entire founding team with new management. I've never seen a startup with none of its founders remaining succeed.

2. Market. Many startups try the most tempting option: to keep the same technology/product and look for a new market.  After all, the investment in product development has already been made.  Unfortunately, while this approach seems the most logical, it is also the least likely to succeed. Why? The hardest part of a startup is understanding the requirements of the market, not building the product. After the dot-com bust in 2000, many consumer internet startups tried to reinvent themselves as enterprise technology providers (remember Chemdex?). The startup junkyard is littered with the carcasses of dot-coms that took this route and failed.

3. Business Model. A very attractive strategy is to keep the same product and market, but change the business model. In my experience, this is the most likely option to succeed. For example, enterprise software companies can reinvent themselves by open-sourcing their software and providing consulting services, or a premium version. A software vendor can reboot as a software as a service (SaaS) provider on the Web. Consumer websites can move to a subscription model from an advertising model, or vice versa.

4. Product. Another smart reinvention approach is to addressing the same market (or a closely related one), but change the product or the business model. This option works best when the market need is real, but the product does not adequately address the opportunity. I've found that the key to success is to throw away the old product completely and start from scratch, using the hard-won learnings about the market acquired from the first iteration. In some cases, it makes sense to move the old product to "maintenance mode" and reassign the bulk of the team to developing the new product.

I've applied this particular model of reinvention to both companies where I have been a founder -- Junglee in 1997 and Kosmix ten years later, in 2007.

We started Junglee in 1996 to create virtual databases that integrated data from multiple websites. Although we had some initial success, we quickly realized that the architecture of our first product limited our ability to deal with rapidly-changing information, a key success factor in certain markets. We completely rebuilt the product from scratch in 1997, and created the world's first comparison shopping service.  This service was enormously popular and led to Junglee's acquisition by Amazon.com in 1998.

We introduced Kosmix as a vertical search engine, initially in the health sector.  Our idea was to find a better way to help users understand open-ended queries such as "diabetes", which have no single right answer; that is, explore topics rather than find the needle in the haystack. We'd planned to take a vertical-by-vertical strategy, launching sites named RightHealth, RightAutos and RightTrips. Very soon, however, we realized that the vertical approach carries severe limitations, because it's hard for consumers to remember to go to different sites for different topics of interest. We decided to rewrite the product from scratch, and we relaunched Kosmix.com as a horizontal site.  Kosmix lets you explore any topic and gives you a 360 degree view of anything than interests you -- including information from the Deep Web that is inaccessible to the usual search engines. This transition from vertical to horizontal was much harder than it sounds; it required us to rewrite our technology from scratch. But we did it because of our passionate belief that the problem is real and the market opportunity is vast.

While most startup reboots involve rethinking only one or two of the four core components, in some rare cases it makes sense to go the whole hog. Sometimes it pays to be bold: go after an entirely new market opportunity, create a new product, find a new business model, and make large-scale team changes. This approach is fraught with risk; but there have been a couple of spectacular successes. One clear example is Twitter. Another is Twitter's cousin SMS GupShup, a similar service in India. SMS GupShup was born as Webaroo, a company that wanted to create offline copies of large parts of the web so you could browse while offline. A couple of engineers there launched the SMS GupShup service as a lark and it took off; once the management team saw the traction of GupShup, they re-oriented the company around the new idea.

Some startups are born great: the right team starts with the right idea at the right time, and the rest is history. Some have greatness thrust upon them: the right conjunction of market forces propels an unlikely startup to dizzying heights. Other startups, not so lucky as those in the first two categories, need to earn their greatness. And sometimes that requires a reboot.

February 24, 2009 in Entrepreneurship: views from the trenches, Venture Capital | Permalink | Comments (7) | TrackBack (0)

Oscar Halo: Academy Awards and the Matthew Effect

Slumdog Millionaire is one my favorite movies of all time. And I have followed the career of A.R. Rahman, who composed the movie's music, for several years ever since his debut in 1992. So I was quite thrilled when Slumdog was nominated for 10 academy awards -- and Rahman in two categories, Original Score and Original Song. Thrilled, and a little surprised: while I like Rahman's work in Slumdog, I don't think it's his best work. There is of course nothing wrong with that, as long as Rahman's work is better than that of his competitors this year.

But it got me to thinking: if Rahman had composed the same music for an obscure film this year, rather than for Slumdog Millionaire, would he have been nominated? And even if he had been nominated, what are his chances of winning? In other words, is there a Matthew Effect in Oscar nominations -- to them that have, more shall be given? And, once nominated, is there a halo surrounding movies with many nominations that improves the odds of winning across many award categories? I thought it might be fun to run the numbers based on past years' nominees and winners to see if I could find answers to these questions; it turned out to be somewhat instructive as well, since it required an extension of the standard Market Basket analysis from the world of data mining.

To get the data, I went straight to the source: the official Academy Awards database , which lists all the nominations and winners for the past 80 years. Unfortunately there is not a single page that lists all this information, but it was fairly straightforward to write python scripts that queried the website a few times and collated the data in tabular form. The result: a table that lists every nomination and winner in every category beteen 1927 and 2007. There were 8616 nominations in the period, representing 4215 distinct movies; so each movie was nominated on average for 2 award categories.

Let's start first with the nominations, to see if there is any evidence of the Matthew Effect. Let's say N(k) is the number of movies with exactly k nominations. The table below shows k and N(k) for k between 1 and 10. If we ignore two outliers (k=1 and k=7), it appears that N(k+1)/N(k) is close to 0.6 for k between 2 and 10; the decay is certainly much slower than exponential. This indicates that the number of nominations roughly follows a power-law; and a power-law is the classic embodiment of of the Matthew Effect, arising in contexts such as income and wealth distribution. The table below summarizes the data.

Nominations
Movies
1
2796
2
513
3
260
4
195
5
128
6
81
7
87
8
50
9
31
10
29


The next step is to enquire whether there are Oscar categories for which the effect is much stronger than for other categories. To study this, we divide the nominated movies into two groups: movies with 4 or fewer nominations (the "poor" group) and movies with 5 or more nominations (the "rich" group). Overall, 5382 nominations, or 62.5%, went to movies in the poor group and 3234 nominations, or 37.5%, went to movies in the rich group. Now, let's look at the major Oscar categories. The major outliers are Best Picture and Best Director -- both nominations went overwhelmingly to movies in the rich category (70% and 73%, respectively, compared to the average of 37.5%). This is not surprising, because the best picture is typically one that is strong in many disciplines. There is some bias in the acting categories as well, but the big surprise is Film Editing: 68% of the nominations in this category are "rich" movies. At other extreme are Music and Special Effects: approximately 70% of the nominated movies are in the "poor" category. So it appears that in these categories at least, talent gets its due without help from Matthew.

Moving from nominations to actual winners, the obvious question is: does being nominated in many categories boost the chances of winning in a disproportionate manner? To study this, I used the Market Baskets approach from Data Mining. In a classic Market Baskets scenario, we ask which items are often purchased together: such as milk and eggs. In this case, we model each movie as a basket: the contents of a movie's basket are its nominations and wins. Do movies with many nominations in their baskets have a disproportionate number of wins?

We must first deal with a technicality. In a normal market basket scenario, the contents of each basket are independent of every other basket, but in this case there are dependencies. Consider the set of market baskets of the movies that have all been nominated in a single award category in a particular year; clearly, one of these has to be the winner in that category, and so the basket of that movie will also contain a win in that category.

It's easy to extend the Market Baskets model to capture this idea. I'll call the new model Constrained Market Baskets. Consider a subset S of market baskets; say, the set of market baskets corresponding to the "rich" movies with 5 or more nominations. Suppose movie M is in this set, and has been nominated in award category C. If there are (say) a total of 5 nominees in this category, then the prior probability of movie M's basket containing a win is 1/5 or 0.2. We can repeat this for all the categories M is nominated in, and add up the priors; this gives the "prior expected value" of the number of wins in M's basket. We add up the expected wins for all the movies in set S to get the total number of wins we expect the set S of movies to have; call this EW. Now, if OW is the actual number of "Observed Wins" across the movies in set S, we want to see if there is a discrepancy between EW and OW. In particular, we define the "win boost" of set S to be OW/EW. If the win boost is higher than 1, then the set S of market baskets has a disproportionate number of wins, and if it's much less than 1, then it has fewer wins than expected.

When we do the analysis, the set of "poor" movies, with 4 or fewer nominations, had a total of
5382 nominations, with 1143 "expected wins" but only 840 "observed wins"; a win boost of 0.73. The "rich" movies, by contrast, with 3234 nominations, were expected to win 657 Oscars but actually won 958, a win boost of 1.46. In other words: the rich movies, which represent only 37.5% of all nominations, actually won more than half of all the actual Oscar awards! Matthew!

Once again, we can break up the results by category, and look at the win boosts for specific categories of awards. For most major award categories, the win boosts for the rich and poor categories are in line with the overall average boosts. As in the case of nominations, the effect is very significant in the best picture and best director categories: in these categories, the "poor" movies have a win boost of just 0.30! We noted that the Music category seemed resilient to Matthew in the case of nominations; but in the case of wins, this category has a win boost of 1.7 for the rich movies, in line with the overall average. The surprising and significant outlier in this case is the Best Supporting Actor category, with win boosts very close to 1.0 for both the rich and the poor movies. It appears that the Best Supporting Actor award shows no evidence of Matthew; the other acting categories, however, are in line with the overall averages.

I don't have a deep enough understanding of the movie industry and the Academy Awards process to speculate on the reasons for these effects. Perhaps great talent attracts other great talent, and the Awards reflect that reality. And perhaps the difference between the behavior of wins and of nominations has to do with the fact that the former uses simple plurality voting while the latter uses a preferential voting scheme. In any case, I'm happy on two counts. The statistics on the Music category say that the Matthew effect likely did not help Mr Rahman in securing his nominations; but now that he has been nominated, his chances of winning are greatly boosted because he is associated with Slumdog's 10 nominations. Jai Ho!

Update: A big night for Slumdog, winning 8 awards, including both the music and song awards for A. R. Rahman. While 8 awards is not the best Oscar performance ever, it is the most number of awards won by a movie with 10 nominations (the ones that won more awards had more nominations). Matthew must be pleased.

February 21, 2009 in Data Mining | Permalink | Comments (13) | TrackBack (0)

Kosmix Adds Rocketfuel to Power Voyage of Exploration

Kosmix_logo_betaish  

Today I'm delighted to share some fantastic news. My company Kosmix has raised $20 million in new financing to power our growth. Even more than the amount of financing, I'm especially proud that the lead investor in this round is Time Warner, the world's largest media company. Our existing investors Lightspeed, Accel, and DAG participated in the round as well. The Kosmix team also is greatly strengthened by the addition of Ed Zander as investor and strategic advisor. In an amazing career that spans Sun Microsystems and Motorola, Ed has repeatedly demonstrated leadership that grew good ideas into great products and businesses. His counsel will be invaluable as we take Kosmix to the next level as a business.

In these perilous economic times, the funding is a big vote of confidence in Kosmix's product and business. Kosmix web sites attract 11 million visits every month, and we have a proven revenue model with significant revenues and robust growth. RightHealth, the proof-of-concept we launched in 2007, grew with astonishing rapidity to become the #2 health web site in the US. These factors played a big role in helping us close this round of funding with a healthy uptick in valuation from our prior round. Together with the money already in the bank from our prior rounds, we now have more than enough runway to take the company to profitability and beyond.

A few months ago, we put out an alpha version of Kosmix.com. Many people used it and gave us valuable feedback; thank you! We listened, and made changes. Lots of changes. The result is the beta version of Kosmix.com, which we launched today. What's changed? More information sources (many thousands), huge improvements in our relevance algorithms, a much-improved user interface, and a completely new homepage. Give it a whirl and let us know what you think.

To those of you new to Kosmix, the easiest way to explain what Kosmix does is by analogy. Google and Yahoo are search engines; Kosmix is an explore engine. Search engines work really well if your goal is to find a specific piece of information -- a train schedule, a company website, and so on. In other words, they are great at finding needles in the haystack. When you're looking for a single fact, a single definitive web page, or the answer to a specific question, then the needle-in-haystack search engine model works really well. Where it breaks down is when the objective is to learn about, explore, or understand a broad topic. For example:

  • Looking to bake a chocolate cake? We have recipes, nutrition information, a dessert burn rate calculator, blog posts from chow.com, even a how-to video from Martha Stewart!
  • Loved one diagnosed with diabetes? Doctor-reviewed guide, blood sugar and insulin pump slide shows, calculators and risk checkers, quizzes, alternative medications, community.
  • Traveling to San Francisco? Maps, hotels, events, sports teams, attractions, travel blogs, trip plans, guidebooks, videos.
  • Writing an article on Hillary Clinton? Bio, news, CNN videos, personal financial assets and lawmaker stats, Wonkette posts, even satire from The Onion.
  • Into Radiohead? Bio, team members, albums, tracks, music player, concert schedule, videos, similar artists, news and gossip from TMZ.
  • Follow the San Francisco 49ers? Players, news from Yahoo Sports and other sources, official NFL videos and team profiles, tickets, and the official NFL standings widget.


In the examples above, I'm especially pleased about the way Kosmix picks great niche sources for topics. For example, I hadn't heard about chow.com or known that Martha Stewart has how-to videos on her website. Other "gems" of this kind include Jambase, TMZ, The Onion, DailyPlate, MamaHerb, and Wonkette. Part of the goal of Kosmix is to bring you such gems: information sources or sites you have either not heard of, or just not thought about in the current context.

In other words: Google = Search + Find. Kosmix = Explore + Browse.  Browsing sometimes uncovers surprising connections that you might not even have thought about. The power of the model was brought home to me last week as I was traveling around in England. I'd heard a lot about Stonehenge and wanted to visit; so of course I went to the Kosmix topic page on Stonehenge. In addition to the usual comprehensive overview of Stonehenge, the topic page showed me places to stay in Bath, Somerset (which happens to be the best place to stay when you're visiting Stonehenge). It also showed me other ancient monuments in the same area I could visit while I was there. Score one for serendipity. 

Some of us remember the early days of the World Wide Web: the thrill of just browsing around, following links, and discovering new sites that surprise, entertain, and sometimes even inform. We have lost some of that joy now with our workmanlike use of search engines for precision-guided information finding. We built the new Kosmix homepage to capture some of the pleasure of aimless browsing -- exploring for pure pleasure. The homepage shows you the hot news, topics, videos, slide shows, and gossip of the moment. If you find something interesting you can dive right in and start browsing around that topic. We compile this page in the same manner as our topic pages: by aggregating information for many other sources and then applying a healthy dose of algorithms. Dig in; who knows what surprises await?

How does Kosmix work its magic? As I wrote when we put out the alpha, the problem we're solving is fundamentally different from search, and we've taken a fundamentally different approach. The web has evolved from a collection of documents that neatly fit in a search engine index to a collection of rich interactive applications. Applications such as Facebook, MySpace, YouTube, and Yelp. Instead of serving results from an index, Kosmix builds topic pages by querying these applications and assembling the results on-the-fly into a 2-dimensional grid. We have partnered with many of the services that appear in the results pages, and use publicly available APIs in other cases. The secret sauce is our algorithmic categorization technology. Given a topic, categorization tells us where the topic fits in a really big taxonomy, what the related topics are, and so on. In turn, other algorithms use this information to figure out the right set of information sources for a topic from among the thousands we know about. And then other algorithms figure out how to lay the information on the page in a 2-dimensional grid.

While we are proud of what we have built, we know there is still a long way to go. And we cannot do it without your feedback. So join the USS Kosmix on our maiden voyage. Our mission: to explore strange newtopics; to discover surprising new connections; to boldly go where no search engine has gone before!

Update: Vijay Chittoor has posted more details on the new product features on the Kosmix blog. Coverage on TechCrunch, GigaOM, VentureBeat. I'm particularly pleased that Om Malik thinks his page on Kosmix is better than the bio on his site!

December 08, 2008 in Data Mining, kosmix, Search | Permalink | Comments (3) | TrackBack (0)

For Startups, Survival is not a Strategy

Note: As I was working on this post, I ran into Om Malik and showed him a draft. He liked it and asked to post it simultaneously on GigaOM. If you've read it on GigaOM, you can skip reading it here.

In these perilous economic times, the layoff memos often follow a familiar refrain: We have cut costs by 20%. That gives us an additional year's runway. Or two. Yes, startups can cut costs and thereby survive for longer. But just because they can, does not mean they should.

Let me state at the very outset that this article applies only to venture-backed startups, which are a small minority of businesses in the economy. The sole purpose of most businesses is to create a steady income stream for their owners and operators. Venture-backed startups, on the other hand, are created with the sole purpose of leading to a meaningful exit for founders, investors, and employees. Such an exit might be either an IPO or an acquisition.

The raison d' etre for such startups is therefore a successful exit, not mere survival. And the lifeblood of any startup is growth. Growth along some dimension: customers, usage, revenues, or profits. Under most economic conditions, an IPO is impossible without revenue and profit growth  -- and we are unlikely to see a return soon of the times when it was. From an acquisition point of view, stagnant companies are valued at low multiples of revenue -- say 1x to 2x. The comparables are utilities.

A popular meme suggests that "flat is the new up." Given the downturn in the economy, the argument goes, even keeping revenues flat is sufficient. This argument, however, does not apply to startups. By definition, startups are supposed to be attacking nascent market opportunities and unsaturated markets, and so should be able to grow even through a downturn. If a startup cannot find growth in this environment, it's a clear message that the market opportunity might be better served by an established company. Of course, growth in profits or revenues are way better than growth just in usage; but even growth in usage is better than stagnation on all three fronts. There is at least the possibility that a company with strong usage growth might one day be attractive to an acquirer with a good monetization engine.

From a subjective point of view, it's no fun to work at a startup that is not growing along some dimension. Growth is necessary for everyone to enjoy the experience, and feel they are accomplishing something. Stagnation leads to low morale, and people sit around waiting for the axe to fall. It's a slow, agonizing way to die. Rather than let the company become a zombie, management would be doing their investors and employees a favor by advocating in such cases that they pull the plug on the company and return the remaining capital to investors.

Why VCs don’t put the zombies out of their misery

Founders and executives have a lot of emotional capital invested in their companies, and so it is understandable that they shy away from making the ultimate decision. However, the surprising thing is that VCs often allow the zombies to survive for far too long. The reason for this is a subtle misalignment of interests between VCs and their investors. As long as a startup is still alive, VCs can carry the company on their books at the valuation set by the last round of financing. Once they pull the plug, the fund will receive pennies on the dollar, a loss that has to be recorded on the books and doesn't look good when the firm goes to raise their next fund. That’s why every VC portfolio has its fair share of zombies.

Another contributing factor is excessive preference overhangs. Investors receive preferred stock with the right to get back their invested capital ahead of common shareholders in an exit; in some cases they have the right receive a multiple of their invested capital ahead of common shareholders. The total amount that investors need to receive before common shareholders can participate in an exit is called the "preference overhang." 

If a company has raised so much capital that any realistic acquisition will be below the overhang, then common shareholders stand to receive nothing from the sale; and so company management has no incentive to look for such an exit. In such cases, it's important for the VCs and management to agree to restructure the preference overhangs to make such exits attractive to management. Otherwise the company is destined to become a zombie.

Every startup founder and employee has to consider three possible outcomes. Success, failure, and zombiehood. Success is much better than failure, but quick failure beats wasting years of your life on a zombie. If you are a company founder, and you are considering layoffs to extend the runway (perhaps on the advice of your venture investor), you should look at yourself in the mirror and ask whether you are cutting away your growth opportunity and just choosing a lingering death over a quick one.

November 21, 2008 in Venture Capital | Permalink | Comments (2) | TrackBack (0)

Google Chrome: A Masterstroke or a Blunder?

The internet world has been agog over Google's entry into the browser wars with Chrome. When we look back to this event  several years from now with the benefit of hindsight, we might see it either as a master stroke, or as Google's biggest strategic misstep.

The potential advantages to the internet community as a whole are considerable. The web has evolved beyond its roots as a collection of HTML documents and dumb frontends to database applications. We now expect everything from a web application that we do from a desktop application, and then some more: the added bonus of connectivity to vast computing resources in the cloud. In this context, browsers need to  evolve from HTML renderers to runtime containers, much as web servers evolved from simple servers of static files  and cgi scripts to modern application servers with an array of plugins that provide a variety of services. Chrome is the first browser to explicitly acknowledge this transition and make it the centerpiece of their efforts, and will force other browsers to follow suit. We will all benefit.

The potential advantages to Google also are considerable. If the stars and planets align, they can challenge Microsoft's dominance on the desktop by making the desktop irrelevant. Even otherwise, they can hope to use their dominance in search to promote Chrome, gaining significant browser marketshare and ensuring that Microsoft cannot challenge Google's search dominance by building features into Internet Explorer and Windows that integrate MSN's search and other services.

Therein, however, lies the first and perhaps the biggest risk to Google. Until now, Microsoft has been unable to really use IE and Windows to funnel traffic to MSN services and choke off Google. Given their antitrust woes, they have been treading carefully on this matter. Any overt attempt by them will evoke cries of foul from many market participants. Google has been in a great position to lead the outcry, because it has been purely a service accessible from the browser, without any toehold in browser market itself.

Chrome, however, eases some of the pressure on Microsoft. If Microsoft integrates MSN search or other services tightly into IE, it will be harder for Google to cry foul -- Microsoft could point to Chrome, and any steps taken by Google to integrate their services into Chrome, as counter-arguments. In addition, any outcry from Google can now be characterized as sour grapes from a loser -- Microsoft can say, we both have browsers out there, they have one too, ours is just better, and let consumers decide for themselves.

In some sense, regardless of the actual market penetration of Chrome, Google has lost the moral high ground in future arguments with Microsoft. I wonder whether Google might have achieved all their aims better not by releasing a Google-branded browser, but by working with Mozilla to improve Firefox from within.

Second, while Google has shown impressive technological wizardry in search and advertising, the desktop application game is very different from the internet service game. While users are very forgiving about beta tags that stay for years on services such as gmail, user expectations on matters such as compatibility and security bugs are very high for desktop applications. It remains to be seen whether Google has the culture to succeed in this game, going beyond providing whiz bang features that thrill developers -- such as a blazingly fast Javascript engine -- to deliver a mainstream browser that competes on stability, security, and features.

The third problem is one of data contagion. Google has the largest "database of intentions" in the world today: our search histories, which form the basis of Google's ad targeting. The thing that keeps me from freaking out that Google knows so much about me is that I access Google using a third-party browser. If Google has access to my desktop, and can tie my search history to that, the company can learn much about me that I keep isolated from my search behavior. The cornerstone of privacy on the web today is that we can use products from different companies to create isolation: desktop from Microsoft, browser from Mozilla, search from Google. These companies have no incentive to share information. This is one instance where information silos serve us well as consumers. Any kind of vertical integration has the potential to erode privacy.

I'm not suggesting that Google would do anything evil with this data, or indeed that the thought even crossed their minds; thus far Google has behaved with admirable restraint is their usage of the database of intentions, staying away for example from behavioral targeting. But we should all be cognizant of the fact that companies are in business purely to benefit their shareholders. At some point, someone at Google might realize that the contents of my desktop can be used to target advertising, and it might be prove tempting in a period of slow revenue growth under a different management team.

Two striking historical parallels come to mind, one a masterstroke and the other a blunder, in both cases setting into motion events that could not be undone. In 49 BC, Julius Caesar crossed the Rubicon with his army, triggering a civil war where he triumphed over the forces of Pompey and became the master of Rome. And in 1812, Napoleon Bonaparte had Europe at his feet when he made the fateful decision to invade Russia, greatly weakening his power and leading ultimately to his defeat at Waterloo. It will be interesting to see whether Chrome ends up being Google's Rubicon or its St. Petersburg. Alea iacta est.

September 07, 2008 in Advertising, Search | Permalink | Comments (18) | TrackBack (0)

Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Popularized by Google, the MapReduce paradigm has proven to be a powerful way to analyze large datasets by harnessing the power of commodity clusters. While it provides a straightforward computational model, the approach suffers from certain key limitations, as discussed in a prior post:

  • The restriction to a rigid data flow model (Map followed by Reduce). Sometimes you need other flows e.g., map-reduce-map, union-map-reduce, join-reduce.
  • Common data analysis operations, which are provided by database systems as primitives, need to be recoded by hand each time in Java or C/C++: e.g., join, filter, common aggregates, group by, union, distinct. 
  • The programmer has to hand-optimize the execution plan, for example by deciding how many map and reduce nodes are needed. For complex chained flows, this can become a nightmare. Databases provide query optimizers for this purpose -- the precise sequence of operations is decided by the optimizer rather than by a programmer.

Three approaches have emerged to bridge the gap between relational databases and Map Reduce. Let's examine each approach in turn and then discuss their pros and cons.

The first approach is to create a new higher-level scripting language that uses Map and Reduce as primitive operations. Using such a scripting language, one can express operations that require multiple map reduce steps, together with joins and other set-oriented data processing operations. This approach is exemplified by Pig Latin, being developed by a team at Yahoo. PigLatin provides primitive operations that are commonly found in database systems, such as Group By, Join, Filter, Union, ForEach, and Distinct. Each PigLatin operator can take a User Defined Function (UDF) as a parameter.

The programmer creates a script that chains these operators to achieve the desired effect. In effect, the programmer codes by hand the query execution plan that might have been generated by a SQL engine. The effect of a single Map Reduce can be simulated by a Filter step followed by a Group By step. In many common cases, we don't even need to use UDFs, if the filtering and grouping criteria are straightforward ones that are supported in PigLatin. The PigLatin engine translates each script into a sequence of jobs on a Hadoop cluster. The PigLatin team reports that 25% of Hadoop jobs on Yahoo today originate as PigLatin scripts. That's impressive adoption.

Another interesting solution in this category is Sawzall, a new scripting language developed at Google. Sawzall allows map reduce operations to be coded using a language that is reminiscent of awk. If your computation fits the Sawzall model, the code is much shorter and more elegant than C/C++/Java Map and Reduce functions. Sawzall, however, suffers from two drawbacks: it limits the programmer to a prefined set of aggregations in the Reduce phase (although it supplies a big library of these); and it offers no support for data analysis that goes beyond a single Map Reduce step, as PigLatin does. Most important, Sawzall is not available outside of Google, while PigLatin has been open-sourced by Yahoo.

The second approach is to integrate Map Reduce with a SQL database. Two database companies have recently announced support for MapReduce: Greenplum and Aster Data. Interestingly, they have taken two very different approaches. I will call Greenplum's approach "loose coupling" and Aster Data's approach "tight coupling". Let's examine each in turn.

Greenplum's loose-coupling approach ties together Greenplum's database with Hadoop's implementation of Map Reduce. A Hadoop Map Reduce operation is visible as a database view within Greenplum's SQL interpreter. Conversely, Hadoop map and reduce functions can access data in the database by iterating over the results of database queries. Issuing a SQL query that uses a map-reduce view will launch the corresponding map-reduce operation, whose results can then be processed by the rest of the SQL query.

Aster Data's tight-coupling approach is more interesting: the database natively supports map reduce (with no need for Hadoop). Map and reduce functions can be written in a variety of programming languages (C/C++, java, python). Aster has extended the SQL language itself to support how these functions get invoked, creating a new SQL dialect called SQL/MR. One of the cool features is that map and reduce functions are automatically polymorphic, just like native SQL functions such as SUM, COUNT and so on: the programmer can write them once and the database engine can invoke them with rows with different numbers of columns and columns of different types. This is a huge convenience over the Hadoop approach.

What are the pros and cons of these three different approaches? The advantage of the Pig Latin approach is that it works directly at the file level, and therefore it can express MapReduce computations that don't fit the relational data model. An example of such an operation is building an inverted index on a collection of text documents. Databases in general are bad at handling large text and image data, which are treated as "blobs."

The biggest disadvantages of the PigLatin approach is the need to learn an entirely new programming language. There is a large group of developers and DBA's familiar with SQL, and PigLatin does not have this support base. The second disadvantage is that the developer has to code declarative query plans by hand, while SQL programmer can rely on two decades of work on SQL query optimizers, which can automatically decide the order of operations, the degree of parallelism, and when to use indexes.

The advantages and disadvantages of the SQL integration approach in general mirror those of the Pig Latin approach. The loose coupling approach of Greenplum allows the use of files as well as relations, and therefore in principle supports file-based computations. The burden is on the application programmer, however, to decide on the scheduling and optimization of the Hadoop portion of the computation, without much help from the database.

Aster's tight-coupling approach, on the other hand, allows a much greater degree of automatic query optimization. The database system is intimately involved in the way map and reduce operations are scheduled across the cluster, and can decide on the degree of parallelism, as well use strategies such as pipelining across map reduce and relational operators. In addition, since the database system is solely in charge of overall resource allocation and usage, it also ensures sandboxing of user-defined code, preventing it from consuming too many resources and slowing down other tasks. For computations that use only data in the relational database, Aster by far has the most elegant solution; the weakness, of course, is that data stored outside the database is off-limits. 

Update: Tassos Argyros from Aster Data points out that Aster's implementation does in fact allow access to data stored outside the database. The developer needs to write a UDF that exposes the data to the database engine.

All three approaches thus have their strengths and weaknesses. It's exciting to see the emergence of fresh thinking on data analytics, going beyond the initial file-oriented Map Reduce model. Over time, these approaches will evolve, borrowing learnings from one other. In time one or more will become the dominant paradigm for data analytics; I will be watching this space with great interest.

Disclosure: I'm an investor in Aster Data and sit on their Board of Directors. 

September 05, 2008 in Data Mining | Permalink | Comments (14) | TrackBack (0)

More Posts »

About

  • Anand Rajaraman
  • Datawocky

Recent Posts

  • Stanford Big Data Course Now Open to the World!
  • Goodbye, Kosmix. Hello, @WalmartLabs
  • Retail + Social + Mobile = @WalmartLabs
  • Creating a Culture of Innovation: Why 20% Time is not Enough
  • Reboot: How to Reinvent a Technology Startup
  • Oscar Halo: Academy Awards and the Matthew Effect
  • Kosmix Adds Rocketfuel to Power Voyage of Exploration
  • For Startups, Survival is not a Strategy
  • Google Chrome: A Masterstroke or a Blunder?
  • Bridging the Gap between Relational Databases and MapReduce: Three New Approaches

Recent Comments

  • mona on Stanford Big Data Course Now Open to the World!
  • Voyager on Stanford Big Data Course Now Open to the World!
  • Gautam Bajekal on Stanford Big Data Course Now Open to the World!
  • online jobs on Not all powerlaws have long tails
  • rc helicopter on Not all powerlaws have long tails
  • tory burch outlet on Goodbye, Kosmix. Hello, @WalmartLabs
  • SHARETIPSINFO on Goodbye, Kosmix. Hello, @WalmartLabs
  • Almeda Alair on Goodbye, Kosmix. Hello, @WalmartLabs
  • discount mbt on Retail + Social + Mobile = @WalmartLabs
  • custom logo design on Retail + Social + Mobile = @WalmartLabs

Archives

  • September 2014
  • May 2011
  • April 2011
  • April 2009
  • February 2009
  • December 2008
  • November 2008
  • September 2008
  • July 2008
  • June 2008

More...

Blogroll

  • The Numbers Guy
  • Paul Kedrosky's Infectious Greed
  • Life in the Bit Bubble
  • Kosmix Blog
  • John Battelle's Searchblog
  • GigaOM
  • Geeking with Greg
  • Efficient Frontier Insights
  • Data Mining Research
  • Constructive Pessimist, Cynical Optimist

 Subscribe in a reader

Subscribe to Datawocky by Email

Popular Posts

  • Are Machine-Learned Models Prone to Catastrophic Errors?
  • Why the World Needs a New Database System
  • Why Yahoo Glue is a Bigger Deal than You Think
  • The story behind Google's crawler upgrade
  • Affinity and Herding Determine the Effectiveness of Social Media Advertising
  • More data usually beats better algorithms, Part 2
  • More data usually beats better algorithms
  • How Google Measures Search Quality
  • Angel, VC, or Bootstrap?
  • India's SMS GupShup Has 3x The Usage Of Twitter And No Downtime

Categories

  • Advertising (6)
  • Data Mining (11)
  • Entrepreneurship: views from the trenches (2)
  • India (5)
  • Internet Infrastructure (3)
  • kosmix (2)
  • Lewis Carroll (1)
  • Mobile (6)
  • Search (10)
  • Social Media (2)
  • Venture Capital (4)
See More

Twitter Updates

    follow me on Twitter