• In Conversation with John Platt

    You’ve heard from John Platt a few times on this site, including his introductory post on What is Machine Learning? and this piece where he talks about Twenty Years of Machine Learning at Microsoft. Larry Larsen from Channel 9 recently got the opportunity to have an in-depth conversation with him. See what John has to say about AI, Machine Learning and his focus on Deep Neural Networks.

    You can catch the full interview with John here, and here's a short clip of him demystifying Deep Learning. The full article is posted to the “Inside Microsoft Research” blog here.

    ML Blog Team

  • Machine Learning, meet Computer Vision

    This is part 1 of a 2 part series, co-authored by Jamie Shotton, Antonio Criminisi and Sebastian Nowozin of Microsoft Research, Cambridge, UK. The second part was later posted here.

    Computer vision, the field of building computer algorithms to automatically understand the contents of images, grew out of AI and cognitive neuroscience around the 1960s. “Solving” vision was famously set as a summer project at MIT in 1966, but it quickly became apparent that it might take a little longer! The general image understanding task remains elusive 50 years later, but the field is thriving. Dramatic progress has been made, and vision algorithms have started to reach a broad audience, with particular commercial successes including interactive segmentation (available as the “Remove Background” feature in Microsoft Office), image search, face detection and alignment, and human motion capture for Kinect. Almost certainly the main reason for this recent surge of progress has been the rapid uptake of machine learning (ML) over the last 15 or 20 years.

    This first post in a two-part series will explore some of the challenges of computer vision and touch on the powerful ML technique of decision forests for pixel-wise classification.

    Image Classification

    Imagine trying to answer the following image classification question: “Is there a car present in this image?” To a computer, an image is just a grid of red, green and blue pixels where each color channel is typically represented with a number between 0 and 255. These numbers will change radically depending not only on whether the object is present or not, but also on nuisance factors such as camera viewpoint, lighting conditions, the background, and object pose. Furthermore one has to deal with the changes in appearance within the category of cars. For example, the car could be a station wagon, a pickup, or a coupe, and each of these will result in a very different grid of pixels.

    Supervised ML thankfully offers an alternative to naively attempting to hand-code for these myriad possibilities. By collecting a training dataset of images and hand-labelling each image appropriately, we can use our favorite ML algorithm to work out which patterns of pixels are relevant to our recognition task, and which are nuisance factors. We hope to learn to generalize to new, previously unseen test examples of the objects we care about, while learning invariance to the nuisance factors.  Considerable progress has been made, both in the development of new learning algorithms for vision, and in dataset collection and labeling.

    Decision forests for pixel-wise classification

    Images contain detail at many levels. As mentioned earlier, we can ask a question of the whole image such as whether a particular object category (e.g. a car) is present. But we could instead try to solve a somewhat harder problem that has become known as “semantic image segmentation”: delineating all the objects in the scene. Here’s an example segmentation on a street scene:

    In photographs you could imagine this being used to help selectively edit your photos, or even synthesize entirely new photographs; we’ll see a few more applications in just a minute.

    Solving semantic segmentation can be approached in many ways, but one powerful building block is pixel-wise classification: training a classifier to predict a distribution over object categories (e.g. car, road, tree, wall etc.) at every pixel. This task poses some computational problems for ML. In particular, images contain a large number of pixels (e.g. the Nokia 1020 smartphone can capture at 41 million pixels per image). This means that we potentially have multiple-million-times more training and test examples than we had in the whole-image classification task.

    The scale of this problem led us to investigate one particularly efficient classification model, decision forests (also known as random forests or randomized decision forests). A decision forest is a collection of separately-trained decision trees:

    Each tree has a root node, multiple internal “split” nodes, and multiple terminal “leaf” nodes. Test time classification starts at the root node, and computes some binary “split function” of the data, which could be as simple as “is this pixel redder than one of its neighbors?” Depending on that binary decision, it will branch either left or right, look up the next split function, and repeat. When a leaf node is finally reached, a stored prediction – typically a histogram over the category labels – is output. (Also see Chris Burges’ excellent recent post on boosted variants of decision trees for search ranking.)

    The beauty of decision trees lies in their test-time efficiency: while there can be exponentially many possible paths from the root to leaves, any individual test pixel will only pass down just one path. Furthermore, the split functions computation is conditional on what has come previously: i.e. the classifier hopefully asks just the right question depending on what the answers to the previous questions have been. This is exactly the same trick as in the game of “twenty questions”: while you’re only allowed to ask a small number of questions, you can quickly hone in on the right answer by adapting what question you ask next depending on what the previous answers were.

    Armed with this technique, we’ve had considerable success in tackling such diverse problems as semantic segmentation in photographs, segmentation of street scenes, segmentation of the human anatomy in 3D medical scans, camera relocalization, and segmenting the parts of the body in Kinect depth images. For Kinect, the test-time efficiency of decision forests was crucial: we had an incredibly tight computational budget, but the conditional computation paired with the ability to parallelize across pixels on the Xbox GPU meant we were able to fit [1].

    In the second part of this series, we’ll discuss the recent excitement around “deep learning” for image classification, and gaze into the crystal ball to see what might come next. In the meantime, if you wish to get started with ML in the cloud, do visit the Machine Learning Center. 

    Thanks for tuning in.

    Jamie, Antonio and Sebastian

    [1] Body part classification was only one stage in the full skeletal tracking pipeline put together by this fantastic team of engineers in Xbox.

  • Vowpal Wabbit for Fast Learning

    This blog post is authored by John Langford, Principal Researcher at Microsoft Research, New York City.

    Vowpal Wabbit is an open source machine learning (ML) system sponsored by Microsoft. VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms. 

    The name has three references---the vorpal blade of Jabberwocky, the rabbit of Monty Python, and Elmer Fudd who hunted the wascally wabbit throughout my childhood.

    VW sees use inside of Microsoft for ad relevance and other natural-language related tasks. Its external footprint is quite large with known applications across a broad spectrum of companies including Amazon, American Express, AOL, Baidu, eHarmony, Facebook, FTI Consulting, GraphLab, IBM, Twitter, Yahoo! and Yandex.

    Why? Several tricks started with or are unique to VW:

    • VW supports online learning and optimization by default. Online learning is an old approach which is becoming much more common. Various alterations to standard stochastic gradient descent make the default rule more robust across many datasets, and progressive validation allows debugging learning applications in sub-linear time.

    • VW does Feature Hashing which allows learning from dramatically rawer representations, reducing the need to preprocess data, speeding execution, and sometimes even improving accuracy.

    • The conjunction of online learning and feature hashing imply the ability to learn from any amount of information via network streaming.  This makes the system a reliable baseline tool.

    • VW has also been parallelized to be the most scalable public ML algorithm, as measured by the quantity of data effectively learned from, more information here.

    • VW has a reduction stack which allows the basic core learning algorithm to address many advanced problem types such as cost-sensitive multiclass classification. Some of these advanced problem types, such as for interactive learning exist only in VW.

    There is more, of course, but the above gives you the general idea – VW has several advanced designs and technologies which make it particularly compelling for some applications. 

    In terms of deployment, VW runs as a library or a standalone daemon, but Microsoft Azure ML creates the possibility of cloud deployment. Imagine operationalizing a learned model for traffic from all over the world in seconds.  Azure ML presently exposes the feature hashing capability inside of VW via a module of the same name.

    What of the future? We hunt the Wascally Wabbit. Technology-wise, I intend to use VW to experiment with other advanced ML ideas: 

    Good answers to these questions can change the scope of future ML applications wadically.

    John
    Follow my personal blog here.

  • Exploration: Data Science... with Mario Garzia

    This blog post is authored by Mario Garzia, Partner Data Sciences Architect, Technology and Research

     Data Science and “big data” have become 21st century buzzwords in the tech industry. Yet in many ways the term “Big Data” is relative to our ability to collect, store and process data. Big data challenges are not new, historically there have been several notable encounters with big data. One interesting example is the US census. The 1880 census took eight years to tabulate, and at the time estimates were that the 1890 census would take more than 10 years to complete as a result of population expansion. This was a Big Data problem at the time, and a man by the name of Herman Hollerith came to the rescue with an invention that tabulated the 1890 census in a single year and under budget with his company eventually going on to become IBM. Hollerith accomplished this by developing a new and efficient way to collect and store the increasing volumes of data (punch cards) and an electric tabulating machine that could read the punch cards and compute the needed results. There are other similarly interesting big data challenges that occurred both before and after Dr. Hollerith’s time. So is today’s Big Data challenge any different from past ones?

     Data volumes are growing at a rate that continues to challenge our ability to collect, store and process data leading to the development of new technologies. But now the variety of data and the speed at which we collect the data are also accelerating. These growing trends have no visible end in sight. In a 2011 report Ericsson estimated 50 Billion connected devices world wide by 2020, each of these generating their own data in addition to the data exhaust generated by the systems that will manage the collection and processing of the device data. Another big difference today that presents tremendous opportunity is the ability to collect data directly from each of our end customers to learn about their experience with a device or service to a degree never before possible. This allows us to imagine entirely new ways to help and delight customers with new products and services that were previously unimaginable, that better understand what they need now and predict what they might need next. To date high tech companies have been the leaders in this data space, where in some cases the data itself is the product such as Bing Search or social networks, but a great aspect of today’s world is that technology is facilitating the democratization of data and analytics to derive insights across the full spectrum of human endeavor. So not only Big Data leaders but also more traditional businesses and other institutions can now leverage big data to improve their services and delight their customers. We live at a fascinating point in time where the once unimaginable is now becoming possible guided by data and analytics.

    Microsoft has a very rich tradition of using data to gain insights and drive product decisions going back many years, long before Data Sciences and Big Data became terms de rigueur. I joined Microsoft in 1997 and have seen first-hand how we have evolved and grown in the data space. One of the things that I have loved most about working here is the ability to surround myself with and learn from very talented and passionate people. This is a culture where learning, gaining new understandings and striving to be the best are very much engrained. Because of this, data has always played an important role at Microsoft but that role has evolved and expanded over the past decade. We have grown from focusing on having a deep understanding of the product being shipped to also developing a deep understanding of customer experiences with our products and services.

    In 2000 I came to the Windows team to form the Reliability group. Right from the start Windows Reliability was a data driven effort. For example, by the time we shipped Windows server 2000, we already had approximately 100 years of reliability runtime data on internal Microsoft production servers. After Windows server 2000 shipped we expanded data collection to other enterprises by developing a Reliability service for which companies could sign up free of charge, use it to collect reliability data from their datacenter serves and upload the data to Microsoft. This data would then be automatically analyzed and the results made available to each company individually in a website containing availability and reliability results and trends segmented by server type and computing environment. In many cases this was the first time these companies had access to such detailed data on the reliability of their data centers. This data could then also be leveraged by Windows to gain insights into operating system (OS) reliability and failure modes, set release criteria for new versions of the OS, and prioritize and drive product fixes based on failure frequency and gravity. We also used the insights from this data to develop new OS features like diagnostic services. This data driven approach allowed us to make decisions for when the product was ready to ship based on actual production system runtime criteria. While deep and comprehensive, this data was focused on product quality and ship readiness. Today the Windows operating system, and indeed all our products and services are focused not only on product quality attributes but also on better understanding customer needs. There is a renewed and expanded emphasis on building a data driven culture at the company where service and product quality remain critical but just as critical is the deep understanding of customer satisfaction, engagement and wants. Insights derived from data are used across all Microsoft products and services to deliver new, powerful features and capabilities.

    Being a data driven culture means that understanding the product and customer data is not just for Data Scientists but for all of Microsoft, everyone needs to be data aware and data driven at Microsoft. Big data is used for product and service experimentation, improvement, and also to deliver enhanced and customized services leveraging techniques such as Machine Learning. Bing  and Bing Ads are completely data driven.  There is also a very deep heritage in Machine Learning at Microsoft over the past 20 years, from its beginnings with Bayesian Networks and speech recognition research to products such as SQL Server Data Mining. We now give companies the ability to build machine learning models and easily deploy them to the cloud with Microsoft Azure ML.

    An exciting aspect of being a data scientist at Microsoft is the unparalleled breadth of customer touch points we have from computers and tablets, to phones, devices, gaming, Search and a myriad of services allowing us to better understand customer wants and experiences and use those insights to impact their everyday life in new and meaningful ways. The Data Sciences disciplines are at the core of our data driven corporate strategy. At Microsoft we recognize this and have a full engineering career path for Data Scientists, Machine learning Scientists and Applied Scientists that can reach the most senior levels in the company.  We have multiple data scientist groups throughout the company resulting in a very vibrant and growing community. I believe there is no better place than Microsoft for a Data Scientist to learn, grow, have fun and make an impact.

    An important event that many Microsoft Data Scientists attend each year is the Knowledge Discovery and Data Mining conference that takes place in August; this year in New York City. This is a premiere conference for data sciences. I am very much looking forward to attending this year’s KDD conference, I have been attending for many years now. It is great to share in the energy and excitement, exchange ideas with colleagues and meet new people. I always come out of the conference totally charged by the new ideas and people I’ve met. Microsoft is a Gold Sponsor at KDD this year and we are very excited to be there. Please make sure to stop by our Microsoft exhibitor booth to view demos from our Data Scientists in the Azure Machine Learning team, Bing team, MSR and many others. I hope to meet some of you at the conference.

    Mario

    Connect with me on LinkedIn

    If you are interested in a career at Microsoft, check out our openings:

    Data Science Jobs at Microsoft

    Machine Learning Jobs at Microsoft

  • Machine Learning, meet Computer Vision – Part 2

    This blog post is co-authored by Jamie Shotton, Antonio Criminisi and Sebastian Nowozin of Microsoft Research, Cambridge, UK.

    In our last post, we introduced you to the field of computer vision and talked about a powerful approach, classifying pixels using decision forests, which has found broad application in medical imaging and Kinect. In this second post we will look at some of the recent excitement around deep neural networks and their successes in computer vision, followed by a look at what might be next for computer vision and machine learning.

    Deep neural networks

    The last few years have seen rapid progress in the quality and quantity of training datasets we have access to as vision researchers. The improvements are to a large extent due to the uptake of crowdsourcing which has allowed us to scale our datasets to millions of labelled images. One challenging dataset, ImageNet, contains millions of images labeled with image-level labels across tens of thousands of categories.

    After a few years of slow progress in the community on the ImageNet dataset, Krizhevsky et al. rather rocked the field in 2012. They showed how general-purpose GPU computing paired with some seemingly subtle algorithmic changes could be used to train convolutional neural networks much deeper than before. The result was a remarkable step change in accuracy in image classification on the ImageNet 1000-category test. This also garnered a lot of attention in the popular press and even resulted in some very large start-up buyouts. Since then “deep learning” has become a very hot topic in computer vision, with recent papers extending the approach to object localization, face recognition, and human pose estimation.

    The Future

    While clearly very powerful, are deep convolutional networks the end of the road for computer vision? We’re sure they’ll continue to be popular and push the state of the art in the next few years but we believe there’s still another step change or two to come. We can only speculate as to what these changes will be, but we finish up by highlighting some of the opportunities as we see them.

    Representations: These networks learn to predict a relatively simple representation of the image contents. There’s no deep understanding of where individual objects live in the image, how they relate to one another, or the role of particular objects in our lives (e.g. we couldn’t easily combine the cue that a person’s hair looks slightly glossy with the fact that they are holding a hair-dryer to get a more confident estimate that their hair is wet). New datasets such as Microsoft CoCo may help push this forward by providing very detailed labeling of individual object segmentations in “non-iconic” images – i.e. images where there’s more than one object present that are not front-and-center.

    Efficiency: While the evaluation of a deep network on a test image can be performed relatively quickly though parallelization, neural networks don’t have the notion of conditional computation that we encountered in our last post: every test example ends up traversing every single node in the network to product its output. Furthermore, even with fast GPUs, training a network can take days or weeks which limits the ability to experiment rapidly.

    Structure learning: Deep convolutional networks currently have a carefully hand-designed and rather rigid structure that has evolved over many years of research. Changing, say, the size of a particular layer or the number of layers can have undesirable consequences to the quality of the predictor. Beyond simply brute-force parameter sweeps to optimize the form of the network, we hope there is opportunity to really learn a much more flexible network structure directly from data.

    Recently, we have been taking baby steps towards addressing particularly the latter two of these opportunities. We’re particularly excited by our recent work on decision jungles: ensembles of rooted decision DAGs. You can think of a decision DAG as a decision tree in which child nodes have been merged together so that nodes are allowed to have multiple parents. Compared to decision trees, we’ve shown that they can reduce memory consumption by an order of magnitude while also resulting in considerably improved generalization. A DAG also starts to look a lot like a neural network, but does have two important differences: firstly, the structure is learned jointly with the parameters of the model; and secondly, the DAG retains the idea from decision trees of efficient conditional computation: a single test example follows a single path through the DAG, rather than traversing all nodes as would be the case with a neural network. We’re actively investigating whether decision jungles, perhaps in conjunction with other forms of deep learning including stacking and entanglement, can offer an efficient alternative to deep neural networks.

    If you’re interested in trying out decision jungles for your problem, the Gemini module in Azure ML will let you investigate further.

    Overall, computer vision has a bright future thanks in no small part to machine learning. The rapid recent progress in vision has been fantastic, but we believe the future of computer vision research remains an exciting open book.

    Jamie, Antonio and Sebastian