The Storage Team Blog about file services and storage features in Windows and Windows Server.
Microsoft Research (MSR) and the Windows File Server team worked together to build a new Data Deduplication feature in Windows Server 2012. This feature came from 2 years of collaboration with MSR on the design. The development of the architecture and the algorithms we use for deduplication was driven, in part, by analysis of data in a large global enterprise. The USENIX Annual Technical Conference (ATC) was held on June 13-15, and we submitted a Large Scale Study and System Design paper and gave a talk about our findings. The new paper and presentation video have just gone public on the USENIX website.
The paper describes the algorithms used to chunk data, identify unique data chunks using indexes on chunk hashes, and how to scale deduplication resources on large amounts of data, including performance evaluation numbers. The paper and talk give a review of the advanced analysis carried out on the datasets and how the insights were used to determine design points that address the challenges of primary data deduplication. Many of the design decisions for deduplication were made to create a balance of on-disk space savings, resource usage, performance, and transparency. The key feature is that deduplication can be installed on primary data volumes without impacting the server’s regular workload and still offer significant savings.
Primary data serving, reliability, and resiliency aspects of the system are not covered in this paper.
Check out the live video of the talk given by Sudipta Sengupta and Adi Oltean and download the PDF of the paper here: https://www.usenix.org/conference/usenixfederatedconferencesweek/primary-data-deduplication%E2%80%94large-scale-study-and-system
Cheers, Scott M. Johnson Program Manager II Data Deduplication Team