« Angel, VC, or Bootstrap? | Main | Change the algorithm, not the dataset »

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83471bc3153ef00e5534b7f958833

Listed below are links to weblogs that reference How Google Measures Search Quality:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Ethan Herdrick

Is that last vignette (about the Yahoo and Google results) verifiable? It shocks me.

Anand Rajaraman

Ethan: Peter Norvig told me so, and I trust him, so I assume it's true. I haven't personally verified it.

Dan Lewis

This quote from Sutton and Barto comes to mind:

"In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do. For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such taking its opponent's pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent's pieces even at the cost of losing the game. The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved."

http://www.cs.ualberta.ca/~sutton/book/ebook/node29.html

There's a paradoxical quality to not being able to use user clicks to train the ranker. On the surface, it seems like you'd be training toward the real goal by only using user clicks for feedback on the results page. But it turns out it's another case of finding a low-value local maximum.

Another problem is that the reward signal for clicking results, or a query/click history, is ambiguous. There are some easy cases: for instance, I do a query, click the first result and stop searching for a while. My result was probably pretty good. I do a query, don't click anything, then do another query on related keywords. Odds are, I didn't like the results of the first query.

But it can get difficult to interpret these results in a hurry. I do a query, click the first result, then click the second result, then click the third result... I was fooled by the snippets into making some clicks on inadequate pages maybe? I'm really interested in the subject and I want to see multiple pages? And so on.

Worse, there's no unambiguous way to say "these results stink" on the results page, or to spread around the negative reward if they do stink. (Except for the "Dissatisfied" report at the bottom of the page, which takes a pretty committed crazy person to write out consistently, and doesn't scale.)

If Google wants to get a lot of free booing data, they should implement this: if you think a result is unhelpful, click the thumbs down next to the result. You could restrict it to people who are signed in with Personalized Search History, if you wanted to prevent griefing.

Nisha Pillai

Clearly, there is no such thing as an objective ranking of search quality in existence today. I would expect that using panels of human raters allows for capturing some subjectivity in that ranking. And precisely because of that, the ranking data would be noisy enough that it would be difficult to directly use as training data for a ML model to emulate. If this is indeed the prevailing thinking, one can see why ML results would be considered a good input into the hand-tweaked scoring system, but not a reliable replacement.

@Dan -- subjectivity in rating search results is the best reason I can think of not to use a thumbs up/down model. That model works for a subjective "like/dislike" rating a la Digg, but may not quite work when the goal is objective ranking. What additional information would this provide beyond what can be inferred from analyzing click data?

Dan Lewis

I'd say that Google's current page can reward for extraordinary success but can't reward for extraordinary failure (again, without filling out the Dissatisfied report. I would be curious to know how many of those they get).

For instance, in the current model, you do a query and don't click on anything in the first page. That's more passive (and hard to interpret) than seeing ten results worth of SEO spam and clicking "Report Spam". It's more passive than searching for, say, boating equipment, and voting down a bunch of results about the structure of English tea parties. I also think you get a better story on clicking on a result, coming back to the results page, and then downvoting it. You capture a subtlety that someone thought this might be a good result for them, clicked it, and it turned out it wasn't. That's a human being vetting the actual content of a result in your search engine, and it's all automatic.

The beauty is, unless you think the ultimate goal is to teach Google's search engine "the truth", however you define it (and bootstrap Skynet, but that is another story...), it doesn't totally matter why people are downvoting something they don't want to see in their results. They just are, and it's another signal you can use to train the search engine to show people what they want to see and not what they don't.

Also, let me point out that if you don't believe in a certain amount of subjectivity in results (balanced by the sanity check of a certain amount of objectivity), you don't believe in PageRank either. But, contrary to this intuitive distrust of subjectivity, it works! The ad hoc connections and opinions and judgments of the internet imply a fruitful, orderly structure.

I haven't said much about upvoting, either, but I think you could use that too.

The larger point is that Google could be enlisting its customers to give it free feedback on their algorithm, just another important signal to roll in and use with their other couple hundred signals.

Daniel Tunkelang

Very nice post! I just blogged about it at The Noisy Channel: http://thenoisychannel.blogspot.com/

I am surprised that they are taking such a retro Cranfield approach. Granted, relying naively on clickthrough data as implicit user relevance assessments can lead to positive feedback loops, but I'd think that Google is in a position to perform controlled experiments on its vast audience of users.

But I guess they're happy enough with the results.

Sumit

Anand,

What about using tonnes of user endorsed web-links that Google has via Gmail, Notebook, Bookmarks. The comments / conversations / tags around those weblinks can be used to glean the keywords or atleast validate the offline indexing while keeping privacy issues at bay.

I would think that a good ML model working on such data would be way better than professional raters.

PJ Brunet

Back-button doesn't indicate anything. Like if I'm looking for comments about some Wordpress function and I skip over a few sites that doesn't mean those websites weren't informative or didn't have the answer, it just means I'm browsing the web which is the whole point of the Internet.

balajis

> "Informational queries, where the user has a broader topic. e.g., "diabetes pregnancy". In this case, there is no single right answer. Suppose there's a really fantastic result on page 4, that provides better information any of the results on the first three pages. Most users will not even know this result exists! Therefore, their usage behavior does not actually provide the best feedback on the rankings."

It seems to me that an obvious approach is to slightly "degrade" service via infrequent randomization in the short run in return for information that will improve search results in the long run.

To be precise:

One observed dependent variable is click rank (i.e. URLs ranked by number of clicks), which is a function of both display rank (an observed independent variable) and link quality (an unobserved variable). There are lots of other variables (query refinements, etc.) but consider this simple model for now.

By plotting click rank vs. display rank for each query, what you are looking for are URLs which significantly deviate from the regression line. Those above the line have more clicks than their display rank would predict, indicating that they are high quality and should be promoted. Those below the line have fewer clicks and should be demoted.

Now, the technical difficulty is disentangling the extremely strong effect of display rank on click rank (the "don't know what you missed" phenomenon).

To account for this, you can introduce some controlled randomness into query ranking. The idea is that for the *same query*, you occasionally shift results from pages 2-4 into the front page and vice versa[1]. Some kind of Boltzmann-ish scheme with an exponentially decreasing probability of displacing highly ranked results would likely work[2].

Suppose further that you have also already pre-clustered users by search history/email/reader history/etc.

You can now ask the question: how does the "same" user (= groups of related users) respond to different search ranking schemes for the same query? One way to visualize this would be to imagine 10 different randomized URL lists for the same query. You collect data from a defined user group on this query. Now you make a plot with three axes: click rank, display rank, and user group.

You are now varying display ranks for the same URL and measuring click performance on similar users. A high quality URL will consistently punch above its weight for a given display rank, and you can promote it (and conversely for a low quality URL).

The main disadvantage of such a method is that it would be best applied to queries attempted by at least 1000-10000 different users. It's much harder to do statistics on infrequent queries, but there are a lot more tricks that you can try there (e.g. aggregating rare queries or using priors derived empirically from more frequent queries).


[1] Most people will never notice this as people expect query results to change from time to time, but you can implement something which freezes the randomization if the same person attempts the same query within a few hours or days (no caching needed, just make the random number seed a function of userid and time rounded into -- say -- two day chunks).

[2] You'd likely have to modify this to take into account link similarity (i.e. Goog uses diversity based selection on their front page, where a query for Usher returns Usher syndrome, House of Usher, Usher the artist, etc.). That is, two similar URLs would be more likely to swap than two dissimilar URLs.

Ted Dunning


Analyzing click-through versus search rank is much harder than it looks. The problem with simple minded approaches is that a single document can appear in results for many queries and can (will) appear at different ranks for the same query. Also, the clicking behaviors of different communities users are highly divergent so you have trouble even describing the overall relation of rank to click-through.

The best method I know of for doing this analysis is to build a specialized generalized regression that fits parameters of a mixture model to explain the click-through. You wind up having to do EM-like iterations where you fit models for the queries regarding the users who issued that query and the amount and kind of content available. Then you fit models for the documents based on their click-through for different queries. Then you rinse and repeat.

This process is not easy to get right.

Ted Dunning


It should be pointed out that randomization of query results is exactly analogous to the random step in the classical page rank algorithm. The purpose is pretty much the same as well ... you need to smooth the graph structure out a bit in order to improve mixing between multiple modes.

I can personally attest to the virtues of randomization from the standpoint of data collection, but it also has a secondary effect. Far from being disturbed by variation in recommendations or search results, users actually pay *more* attention to results that vary over time. Or rather, users pay much less attention to completely static results. Randomization on results pages that have high repeat viewing can paradoxically increase click through if only because it is a way to get users to actually look at some of the good results that are on the 4th page.

james touso

This process is not easy to get right..

Flea Circus Research Library

> we have all been trained to trust Google and click on the first result no matter what.

I don't do this, I must have been away the day that training course was run. Looking at yesterday's searches some of them I have used the first hit but nearly all of them I've looked at the URL and/or summary. Several of them (approx 20%) the links I clicked were 2 or 3 down the list. This is typical of my day to day usage.

When I've been searching for Flea Circus material I've often been looking at the tail end of the queries and the hit I needed was perhaps on page 40 or 50.

I'm not aware of the either the percentage of users who do blindly select option 1 or the percentage of users required to be about to use the word "all" (presuming it's less than 100%). But I presume you meant "most" or "nearly all" rather than 100% of users.

Do more users select the top item than use the "I feel lucky" option?

One more observation, there's more than one person using Google hence there's more than one version of "The Truth". If you use the Google Webmaster tools then you will see that for the same keywords the different versions (countries) of google return different results so a German typing in "funk" might get a list of radio shops where a Brit would get a list of podcasts and CD shops. Even allowing for this facility there's got the be different versions of "truth" dependant on the people entering the queries.

Nirose

Thanks

Gordon Rios

Actually, there's been a lot of good research on accounting for rank bias in click data -- for example, see this 2008 paper from MSR:

http://research.microsoft.com/~onnoz/craswell_wsdm08.pdf

and there's a video as well:

http://videolectures.net/wsdm08_craswell_eccp

The "hidden metric" issue (e.g. "optimize a metric hidden from my adversary") is a serious problem at the big search engines but luckily there's market share to observe. Search engines like Google that significantly reduce the uncertainty of finding the right result (navigational) or a set of useful results (informational) attract and hold users more than other engines; over time that leads to increased market share.

Metrics that relate to market share such as "searches per user" and "return rates" are useful benchmarks. Of course, some people believe that search engine market share has a significant dependency on "marketing" but that's probably just wishful thinking :)

The comments to this entry are closed.