Alon Halevy and Jayant Madhavan have a post on Google's Webmaster blog disclosing that Google is now harvesting data that is hidden behind HTML forms. I'm very satisfied about this, because it's something that I had a hand in making happen.
I've known Alon since 1995, when he was a researcher and I was his summer intern at Bell Labs in New Jersey (those were the days when Bell Labs was still relevant to Computer Science research). Until that summer, Alon and I had been working independently of each other on the topic of integrating information from across different data sources (such as websites) -- Alon at Bell Labs and I at Stanford, for my PhD dissertation. We put together our ideas and created the Information Manifold, an information integration system that introduced a key idea: you could describe each data source (declaratively) by a local data schema; map that schema to a global schema that unified all the sources; and process queries across the global schema.
This task is complicated by the fact that each data source exposes different query processing capabilities, by virtue of the different HTML forms it uses. We came up with some simple solutions to this problem, but several difficulties remained. In the meantime, Alon and I went our ways with our careers; Alon joined the CS faculty the University of Washington, while I started a company, Junglee, that applied information integration ideas to build the first comparison shopping engine and online job bank. Junglee was acquired by Amazon.com in 1998; in 2000, I started a Venture Capital firm called Cambrian Ventures and in late 2004 a startup, Kosmix.
Alon and I kept in touch through all this, and in 2004, he and his student Jayant Madhavan had made great progress on "schema matching", a research area that has application to the HTML forms problem we had first encountered in the Information Manifold. I was very happy to egg them on to commercialize their breakthrough, and to provide the funding for the resulting company, Transformic, through Cambrian Ventures.
Between 1995 and 2005, Web search had become the dominant mechanism for finding information. Search engines, however, had a blind spot: the data behind HTML forms. Called the Invisible Web, this data is often estimated to be even larger in size and usefulness than the "visible web" that web crawlers usually index. The key problems in indexing the Invisible Web are:
- Determining which web forms are worth penetrating.
- If we decide to crawl behind a form, how do we fill in values in the form to get at the data behind it? In the case of fields with checkboxes, radiobuttons, and drop-down menus, the solution is fairly straightforward. In the case of free-text inputs, the problem is quite challenging -- we need to understand the semantics of the input box to guess possible valid inputs.
Transformic's technology addressed both problems (1) and (2). It was always clear to us that Google would be a great home for Transformic, and in 2005 Google acquired Transformic -- a nice return for Cambrian, but also a great place for the Transformic team to make a real difference with their ideas. The Transformic team have been been working hard for the past two years perfecting the technology and integrating it into the Google crawler. I'm very happy to have played a small role in their success story.
An aside on the world of academic research. Alon and I had some difficulty getting our Information Manifold research published: it was rejected the first time we submitted it to a leading academic conference; we had to address a lot of criticism and skepticism from the establishment, and it was finally published at the VLDB conference in 1996. Remarkably, this paper has since become one of the most cited and influential papers in its field (see this survey, page 52). In 2006, the paper received the 10-year Best Paper Award at VLDB, given retrospectively to the publication from 10 years ago that made the most impact. The moral of the story is, sometimes it pays to swim against the current in academic research; what is fashionable today is rarely what will ultimately make the most impact.