This blog post is authored by Chris Burges, Principal Research Manager at Microsoft Research, Redmond.

Hi, I’m Chris Burges. Over my last 14 years at Microsoft, and my previous 14 at Bell Labs, I’ve spent much of my time dabbling with machine learning (ML), with some of that time spent on solving industrial strength problems. Since interest in ML, especially in industrial settings, has blossomed lately, it seems like a good time to think about the big picture of how ML works, both from a practical view and from an algorithmic one.

In 2004, Microsoft Research (MSR) and Microsoft’s Web Search team decided to see if we could jointly improve the relevance of our web search results. The system used at the time was called The Flying Dutchman.  Over a period of several months we designed a system that not only improved relevance, but, in addition, proved much easier to experiment with: whereas The Flying Dutchman took several days and a cluster to produce a model (to rank web search results), a simple neural net ranker called RankNet was able to produce a ranking model in about an hour using just one machine.

What unfolded over the next few years is a fascinating story of the interplay between science, research, algorithm design and product engineering. In this first post, I’m hoping to give you a feel for that interplay, and in later posts, I’ll explain how the basic algorithms used today actually work, assuming no prior knowledge of ML. We have already touched on one of the keystones of progress: the ability to do rapid experimentation. If you have what you think is a good idea, ideally you’d like experimental evidence, one way or the other, immediately. Thus even if a model initially does not perform quite as well as what one already has, if it is much faster to train and test, overall progress can be much faster, and this alone will often enable the model to quickly surpass the current one in accuracy and speed.

Today, a family of models called Boosted Decision Trees (BDTs) are particularly popular. BDTs are flexible in that they can be used to solve different kinds of predictive tasks, for example:

  • Ranking, e.g. placing the most relevant web search results at the top of the list,

  • Classification, e.g. determining if a particular email is spam or not, and

  • Regression, e.g., predicting what price your house might sell for. 

Flexibility is great, but how useful are BDTs, really? Logs collected on an ML service that is used internally within Microsoft show that, over the past year alone, there were over 670,000 training runs using BDTs throughout Microsoft. This number is inflated because a given experimenter will typically perform model selection (i.e. train multiple models, each with a different parameter setting, and using a hold out data set to pick the best model), but it gives the general picture. Is this preoccupation with BDTs a Microsoft proclivity, or do people elsewhere like them too? In 2010, Yahoo! organized a learning to rank challenge, one track of which was designed to see who had the best web search ranking algorithm. Over one thousand teams registered for the challenge. While it was gratifying that the Microsoft team won, the rankings were close, and for me the most interesting takeaway was that the top 5 systems all used ensembles of decision trees, and boosting, in one form or another (in fact our system was an ensemble of BDTs and neural nets). So, if you’re thinking of training a fixed model to solve a predictive task, it’s worth considering BDTs.

Let’s use web search ranking as our canonical example to explore a typical research / product cycle. The hardest challenges of doing research are asking the right question, and getting good validation of the ideas (in addition to the time-honored method of publication, which as a validation test can be quite noisy). Working on real problems that matter to millions of people is a pretty good way of getting help with both of these challenges.

When you issue a query to Bing, we will effectively scan all the documents in our index. A large number of candidate documents are weeded out by applying some very fast filters (e.g. we may not even consider documents that have no words in common with your query). This reduces the set of candidate documents to a manageable size. For each such candidate document, we generate several thousand features that indicate how relevant that document might be for your query. For example, one feature might be “does the document title contain any words in the query?” or, at a higher level, “does the document refer to an entity mentioned in the query?” The task of the ranking model is to take this list of features and map it to a single score that encodes the relevance of that document for that query. This, in combination with the initial filtering process, allows us to rank all documents on the web by their relevance to your query. We used to measure the quality of the search results using a single metric called NDCG (we now use several metrics to try to gauge user satisfaction). The NDCG value for a given query depends on the entire ranked list and it takes values between 0 and 1, where 1 indicates the best ranking achievable on a special, labeled set of data (which we’ll call D).

So, how did we get from RankNet to BDTs? RankNet, although a breakthrough at the time, is not well adapted to the task: in particular, it ignores the NDCG measure, and just tries to get the pairwise ordering of the documents correct. So if, for a given query, you had a pair of documents from D, one of which had been labeled a perfect match for the query, and the other terrible, RankNet would spend just as much effort trying to get the perfect placed above the terrible as it would a good above a not quite as good (I should add that these are not the actual labels we use!). The problem in creating a model that directly optimizes for NDCG is that NDCG is ill-behaved, mathematically; if you think of each document as having a score (assigned by your ranking model), such that the ranking is obtained by ordering the documents by their score, then the NDCG changes discontinuously as those scores change continuously. To address this problem we used the fact that, when you train a neural net, you don’t have to provide actual values of the function you’re optimizing, just the gradients (values that indicate how that function would change as the neural net’s output score changes). For the ranking task, you can think of these values as little arrows or forces, pulling each document up or down in the ranked list. We can model these little forces between a pair of documents as the change in NDCG you’d get by swapping the two documents (for the set D), then add up all the forces for each document for a given query, and then use these as gradients to train the neural net. Thus was born LambdaRank, which while still a neural net model, gave better relevance performance than RankNet. Later we extended this idea to boosted tree models with an algorithm called LambdaMART, to leverage some of the advantages that BDTs offer over neural nets, two of which are:

  1. The ability to more naturally handle features whose ranges vary hugely from one feature to another, and

  2. Faster training, and hence faster experimentation turnaround time. 

Subsequently a team led by Ofer Dekel showed how to engineer BDTs so that training became approximately two orders of magnitude faster than for the neural nets, and also able to handle much larger datasets.

That, in a nutshell, is how we came to love BDTs. The overall process was a cycle of engineering and product needs driving the research, and the research opening new opportunities for the product. For two of the three steps (RankNet and BDTs), the main contribution was the ability to do faster experimentation with more data. Although I’ve focused here on the ranking story, it should be noted that there is a great deal more that goes into the quality and engineering of Bing than just the ranking algorithms, which are a small but vital part. In my next post, we’ll take a look at how BDTs actually work.

Chris Burges
Learn about my research.