Catapult: Moving Beyond CPUs in the Cloud

Catapult: Moving Beyond CPUs in the Cloud

  • Comments 14
  • Likes

Posted by Rob Knies

Field-programmable gate array

Operating a datacenter at web scale requires managing many conflicting requirements. The ability to deliver computation at a high level and speed is a given, but because of the demands such a facility must meet, a datacenter also needs flexibility. Additionally, it must be efficient in its use of power, keeping costs as low as possible.

Addressing often conflicting goals is a challenge, leading datacenter providers to seek constant performance and efficiency improvements and to evaluate the merits of general-purpose versus task-tuned alternatives—particularly in an era in which Moore’s Law is nearing an end, as some suggest.

Microsoft researchers and colleagues from Bing have been collaborating with others from industry and academia to examine datacenter hardware alternatives, and their work, a project known as Catapult, was presented in Minneapolis on June 16 during the 41st International Symposium on Computer Architecture (ISCA).

Their paper, titled A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, describes an effort to combine programmable hardware and software that uses field-programmable gate arrays (FPGAs) to deliver performance improvements of as much as 95 percent.

The significance of this work, says Peter Lee, head of Microsoft Research, could be dramatic.

“Going into production with this new technology will be a watershed moment for Bing search,” he says. “For the first time ever, the quality of Bing’s page ranking will be driven not only by great algorithms but also by hardware—incredibly advanced hardware that can be made more highly specialized than anything ever seen before at datacenter scale.”

Microsoft researcher Doug Burger, one of 23 co-authors of the ISCA paper, explains the motivation behind this project.

“We are addressing two problems,” he says. “First, how do we keep accelerating services and reducing costs in the cloud as the performance gains from CPUs continue to flatten?

“Second, we wanted to enable Bing to run computations at a scale that was not possible in software alone, for much better results at lower cost.”

Members of the Project Catapult teamDerek Chiou, a Bing hardware architect, discusses the benefits of the collaboration.

“The partnership between Doug and his team at Microsoft Research and Bing has been fantastic and has resulted in significant results that will have real impact on Bing,” Chiou says. “The factor of two throughput improvement demonstrated in the pilot means we can do the same amount of work with half the number of servers or double the amount of work with the same number of servers—or some mix of the two.

“Those kinds of numbers are especially significant at the scale of a datacenter. The potential benefits go beyond simple dollars. To give some examples, Bing’s ranking could be further enhanced to provide an even better customer experience, power could be saved, and the size of the datacenters could be reduced. The strength of the pilot results have led to Bing deploying this technology in one datacenter for customers, starting in early 2015.”

As the ISCA paper notes, FPGAs have become powerful computing devices in recent years, making them particularly suited for use as fine-grained accelerators.

“We designed a platform that permits the software in the cloud, which is inherently programmable, to partner with programmable hardware,” Burger says. “You can move functions into custom hardware, but rather than burning them into fixed chips [application-specific integrated circuits], we map them to Altera FPGAs, which can run hardware designs but can be changed by reconfiguring the FPGA.

“We’ve demonstrated a ‘programmable hardware’ enhanced cloud, running smoothly and reliably at large scale.”

In the evaluation deployment outlined in the paper, the reconfigurable fabric—interconnected nodes linked by high-bandwidth connections—was tested on a collection of 1,632 servers to measure its efficacy in accelerating the workload of a production web-search service. The results were impressive: a 95 percent improvement in throughput at a latency comparable to a software-only solution. With an increase in power consumption and total per-server cost increase of less than 30 percent, the net results deliver substantial savings and efficiencies.

The results demonstrated the project’s capability to run stably for long periods, and all the stages in the pipeline exceeded the overall throughput goal. In addition, a service to handle failures quickly reconfigures the fabric after errors or machine failures.

The ISCA paper concludes by underscoring the belief that distributed reconfigurable fabrics will play a critical role as server performance increases level off. Such techniques could become indispensable to datacenter managers balancing their conflicting goals.

“This portends a future where systems are specialized dynamically by compiling a good chunk of demanding workloads into hardware,” Burger says. “I would imagine that a decade hence, it will be common to compile applications into a mix of programmable hardware and programmable software.

“This is a radical shift that will offer continued performance improvements past the end of Moore’s Law as we move more and more of our applications and services into hardware.”

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Greetings,

    I am excited about the approach. What is not clear to me in the following article is, part Bing’s page ranking algorithms running on FPGA's, this part is it automatically compiled to HDL and then burnt onto FPGA boards or does it have to be separately programmed? If it must be separately programmed the innovation I see is the access to HDL programming over the cloud as contrasted to a automatic system compiling completely parallelizable logic (part of the algorithm) automatically to FPGA to improve performance. The first way is already standard and maybe used in many scenarios/ event mobile phones and has a great disadvantage of taking too much development time. If automatic it would be a fantastic achievement.