Microsoft Research Connections Blog
Next at Microsoft
Social Media Collective
Windows on Theory
Posted by Frank McSherry, senior researcher at Microsoft Research Silicon Valley
Big data is pretty popular at the moment. Systems such as MapReduce, Hadoop, Dryad, and DryadLINQ have made writing and executing ad hoc big-data analyses easy. Still, there are several programming patterns such systems don't support especially well.
The two we repeatedly heard about from users are incremental and iterative computation. Users want to be able to see small changes quickly, both when starting a computation and when updating previously computed results, and users want to be able to write programs that iterate a subcomputation multiple times, perhaps until convergence.
That is where Naiad comes in.
Researchers have known how to do either incremental or iterative data-parallel processing for quite a while. View-maintenance and continuous-query work in the database community demonstrates incremental computation, and semi-naive data-log evaluation and recursive SQL are capable of efficient iterative computation, among many others. But to the best of our knowledge, no one has managed to do both in a scalable distributed system.
Naiad is a project based on a new computational model called differential data flow, aimed at supporting efficient incremental iterative data flow. Like a lot of previous work on incremental data flow, Naiad is based on processing differences between collections. Unlike prior work in which collections evolve in only one direction, such as with time or iteration, collections in Naiad can evolve in multiple, independent directions, such as both time and iteration.
This can seem complicated at first; if a collection varies in many dimensions, it is less clear from which collection each difference is taken. But with a bit of mathematics involving lattices, a simple definition emerges, supporting efficient incremental and arbitrarily nestable iterative computation.
We are bringing Naiad online at Microsoft Research Silicon Valley, testing its feasibility for complex data-parallel analyses such as maintaining the strongly connected component structure of the Twitter “mention” graph with sub-second update latency. We've also prototyped numerous other computations, including shortest-path computation, PageRank, stable matching, graph partitioning, kernel k-means, and others, all automatically parallelized and incrementalized.
As we get more experience with the system and sort out the bugs, we expect to make it available publicly. Until then, check the Naiad project page.