datasort_lg

The MinuteSort benchmark is considered the “World Cup” of data sorting and is concerned with measuring how much data can be sorted in sixty seconds. Using a new technique called Flat Datacenter Storage (FDS) a team from Microsoft Research (MSR) has just sorted almost three times the amount of data (using only one-sixth of the hardware resources) than the previous record holder, a team from Yahoo!.

In raw numbers, the team’s system sorted 1401 gigabytes in just 60 seconds - using 1033 disks across 250 machines. [the sortbench website has now been updated to reflect this]

The MSR team, led by Jeremy Elson in the Distributed Systems and Networking group, included Johnson Apacible, Rich Draves, Jinliang Fan, Owen Hofmann, Jon Howell, Ed Nightingale, Reuben Olinksy, Yutaka Suzue who will be presented with an award for their achievement this week at the 2012 SIGMOD/PODS Conference in Scottsdale, Arizona.

FDS is the first general purpose system to break the terabyte barrier – fulfilling the late Jim Gray’s long-term vision from a 1994 paper. In a feature story on the Microsoft Research site, Elson remarked how this moves the game on from current state of the art MapReduce and Hadoop systems. The post provides a lot more detail for the data geeks but the key for me is what this means in the emerging world of Big Data.

It’s another example of technology transfer and collaboration between Microsoft Research and our product teams - and it’s no surprise to hear that the Bing team are helping sponsor this work as this isn’t just about academic tests to make faster databases. This type of capability can be applied to many computing areas such as Bing search results, gene sequencing and the crunching of ecological data to name a few.

In the quest to gain real-time insight from the enormous amounts of data we’re generating as a society, this has the potential to offer a huge leap forward, and when allied to the deep expertise in machine learning at MSR, it could be a game changer.