Introduction to Data Deduplication in Windows Server 2012

Introduction to Data Deduplication in Windows Server 2012

  • Comments 19
  • Likes

Hi, this is Scott Johnson and I’m a Program Manager on the Windows File Server team. I’ve been at Microsoft for 17 years and I’ve seen a lot of cool technology in that time. Inside Windows Server 2012 we have included a pretty cool new feature called Data Deduplication that enables you to efficiently store, transfer and backup less data.

This is the result of an extensive collaboration with Microsoft Research and after two years of development and testing we now have state-of-the-art deduplication that uses variable-chunking and compression and it can be applied to your primary data. The feature is designed for industry standard hardware and can run on a very small server with as little as a single CPU, one SATA drive and 4GB of memory. Data Deduplication will scale nicely as you add multiple cores and additional memory. This team has some of the smartest people I have worked with at Microsoft and we are all very excited about this release.

Does Deduplication Matter?

Hard disk drives are getting bigger and cheaper every year, why would I need deduplication? Well, the problem is growth. Growth in data is exploding so much that IT departments everywhere will have some serious challenges fulfilling the demand. Check out the chart below where IDC has forecasted that we are beginning to experience massive storage growth. Can you imagine a world that consumes 90 million terabytes in one year? We are about 18 months away! 


clip_image001


Source:
IDC Worldwide File-Based Storage 2011-2015 Forecast:
Foundation Solutions for Content Delivery, Archiving and Big Data, doc #231910, December 2011

Welcome to Windows Server 2012!

This new Data Deduplication feature is a fresh approach. We just submitted a Large Scale Study and System Design paper on Primary Data Deduplication to USENIX to be discussed at the upcoming Annual Technical Conference in June.


Typical Savings:

We analyzed many terabytes of real data inside Microsoft to get estimates of the savings you should expect if you turned on deduplication for different types of data. We focused on the core deployment scenarios that we support, including libraries, deployment shares, file shares and user/group shares. The Data Analysis table below shows the typical savings we were able to get from each type:

clip_image003

Microsoft IT has been deploying Windows Server with deduplication for the last year and they reported some actual savings numbers. These numbers validate that our analysis of typical data is pretty accurate. In the Live Deployments table below we have three very popular server workloads at Microsoft including:

  • A build lab server: These are servers that build a new version of Windows every day so that we can test it. The debug symbols it collects allows developers to investigate the exact line of code that corresponds to the machine code that a system is running. There are a lot of duplicates created since we only change a small amount of code on a given day. When teams release the same group of files under a new folder every day, there are a lot of similarities each day.
  • Product release shares: There are internal servers at Microsoft that hold every product we’ve ever shipped, in every language. As you might expect, when you slice it up, 70% of the data is redundant and can be distilled down nicely.
  • Group Shares: Group shares include regular file shares that a team might use for storing data and includes environments that use Folder Redirection to seamlessly redirect the path of a folder (like a Documents folder) to a central location.

clip_image005

Below is a screenshot from the new Server Manager ‘Volumes’ interface on of one of the build lab servers, notice how much data that we are saving on these 2TB volumes. The lab is saving over 6TB on each of these 2TB volumes and they’ve still got about 400GB free on each drive. These are some pretty fun numbers.

clip_image007

There is a clear return on investment that can be measured in dollars when using deduplication. The space savings are dramatic and the dollars-saved can be calculated pretty easily when you pay by the gigabyte. I’ve had many people say that they want Windows Server 2012 just for this feature. That it could enable them to delay purchases of new storage arrays.

Data Deduplication Characteristics:

1) Transparent and easy to use: Deduplication can be easily installed and enabled on selected data volumes in a few seconds. Applications and end users will not know that the data has been transformed on the disk and when a user requests a file, it will be transparently served up right away. The file system as a whole supports all of the NTFS semantics that you would expect. Some files are not processed by deduplication, such as files encrypted using the Encrypted File System (EFS), files that are smaller than 32KB or those that have Extended Attributes (EAs). In these cases, the interaction with the files is entirely through NTFS and the deduplication filter driver does not get involved. If a file has an alternate data stream, only the primary data stream will be deduplicated and the alternate stream will be left on the disk.

2) Designed for Primary Data: The feature can be installed on your primary data volumes without interfering with the server’s primary objective. Hot data (files that are being written to) will be passed over by deduplication until the file reaches a certain age. This way you can get optimal performance for active files and great savings on the rest of the files. Files that meet the deduplication criteria are referred to as “in-policy” files.

a. Post Processing: Deduplication is not in the write-path when new files come along. New files write directly to the NTFS volume and the files are evaluated by a file groveler on a regular schedule. The background processing mode checks for files that are eligible for deduplication every hour and you can add additional schedules if you need them.

b. File Age: Deduplication has a setting called MinimumFileAgeDays that controls how old a file should be before processing the file. The default setting is 5 days. This setting is configurable by the user and can be set to “0” to process files regardless of how old they are.

c. File Type and File Location Exclusions: You can tell the system not to process files of a specific type, like PNG files that already have great compression or compressed CAB files that may not benefit from deduplication. You can also tell the system not to process a certain folder.

3) Portability: A volume that is under deduplication control is an atomic unit. You can back up the volume and restore it to another server. You can rip it out of one Windows 2012 server and move it to another. Everything that is required to access your data is located on the drive. All of the deduplication settings are maintained on the volume and will be picked up by the deduplication filter when the volume is mounted. The only thing that is not retained on the volume are the schedule settings that are part of the task-scheduler engine. If you move the volume to a server that is not running the Data Deduplication feature, you will only be able to access the files that have not been deduplicated.

4) Focused on using low resources: The feature was built to automatically yield system resources to the primary server’s workload and back-off until resources are available again. Most people agree that their servers have a job to do and the storage is just facilitating their data requirements.

a. The chunk store’s hash index is designed to use low resources and reduce the read/write disk IOPS so that it can scale to large datasets and deliver high insert/lookup performance. The index footprint is extremely low at about 6 bytes of RAM per chunk and it uses temporary partitioning to support very high scale

c. Deduplication jobs will verify that there is enough memory to do the work and if not it will stop and try again at the next scheduled interval.

d. Administrators can schedule and run any of the deduplication jobs during off-peak hours or during idle time.

5) Sub-file chunking: Deduplication segments files into variable-sizes (32-128 kilobyte chunks) using a new algorithm developed in conjunction with Microsoft research. The chunking module splits a file into a sequence of chunks in a content dependent manner. The system uses a Rabin fingerprint-based sliding window hash on the data stream to identify chunk boundaries. The chunks have an average size of 64KB and they are compressed and placed into a chunk store located in a hidden folder at the root of the volume called the System Volume Information, or “SVI folder”. The normal file is replaced by a small reparse point, which has a pointer to a map of all the data streams and chunks required to “rehydrate” the file and serve it up when it is requested.

Imagine that you have a file that looks something like this to NTFS:
clip_image009

And you also have another file that has some of the same chunks:
clip_image011

After being processed, the files are now reparse points with metadata and links that point to where the file data is located in the chunk-store.
clip_image013

6) BranchCache™: Another benefit for Windows is that the sub-file chunking and indexing engine is shared with the BranchCache feature. When a Windows Server at the home office is running deduplication the data chunks are already indexed and are ready to be quickly sent over the WAN if needed. This saves a ton of WAN traffic to a branch office.

What about the data access impact?

Deduplication creates fragmentation for the files that are on your disk as chunks may end up being spread apart and this causes increases in seek time as the disk heads must move around more to gather all the required data. As each file is processed, the filter driver works to keep the sequence of unique chunks together, preserving on-disk locality, so it isn’t a completely random distribution. Deduplication also has a cache to avoid going to disk for repeat chunks. The file-system has another layer of caching that is leveraged for file access. If multiple users are accessing similar files at the same time, the access pattern will enable deduplication to speed things up for all of the users.

  • There are no noticeable differences for opening an Office document. Users will never know that the underlying volume is running deduplication. 
  • When copying a single large file, we see end-to-end copy times that can be 1.5 times what it takes on a non-deduplicated volume.
  • When copying multiple large files at the same time we have seen gains due to caching that can cause the copy time to be faster by up to 30%.
  • Under our file-server load simulator (the File Server Capacity Tool) set to simulate 5000 users simultaneously accessing the system we only see about a 10% reduction in the number of users that can be supported over SMB 3.0.
  • Data can be optimized at 20-35 MB/Sec within a single job, which comes out to about 100GB/hour for a single 2TB volume using a single CPU core and 1GB of free RAM. Multiple volumes can be processed in parallel if additional CPU, memory and disk resources are available.


Reliability and Risk Mitigations

Even with RAID and redundancy implemented in your system, data corruption risks exist due to various disk anomalies, controller errors, firmware bugs or even environmental factors, like radiation or disk vibrations. Deduplication raises the impact of a single chunk corruption since a popular chunk can be referenced by a large number of files. Imagine a chunk that is referenced by 1000 files is lost due to a sector error; you would instantly suffer a 1000 file loss.

  • Backup Support: We have support for fully-optimized backup using the in-box Windows Server Backup tool and we have several major vendors working on adding support for optimized backup and un-optimized backup. We have a selective file restore API to enable backup applications to pull files out of an optimized backup.
  • Reporting and Detection: Any time the deduplication filter notices a corruption it logs it in the event log, so it can be scrubbed. Checksum validation is done on all data and metadata when it is read and written. Deduplication will recognize when data that is being accessed has been corrupted, reducing silent corruptions.
  • Redundancy: Extra copies of critical metadata are created automatically. Very popular data chunks receive entire duplicate copies whenever it is referenced 100 times. We call this area “the hotspot”, which is a collection of the most popular chunks.
  • Repair: A weekly scrubbing job inspects the event log for logged corruptions and fixes the data chunks from alternate copies if they exist. There is also an optional deep scrub job available that will walk through the entire data set, looking for corruptions and it tries to fix them. When using a Storage Spaces disk pool that is mirrored, deduplication will reach over to the other side of the mirror and grab the good version. Otherwise, the data will have to be recovered from a backup. Deduplication will continually scan incoming chunks it encounters looking for the ones that can be used to fix a corruption.


It slices, it dices, and it cleans your floors!

Well, the Data Deduplication feature doesn’t do everything in this version. It is only available in certain Windows Server 2012 editions and has some limitations. Deduplication was built for NTFS data volumes and it does not support boot or system drives and cannot be used with Cluster Shared Volumes (CSV). We don’t support deduplicating live VMs or running SQL databases. See how to determine which volumes are candidates for deduplication on Technet.

Try out the Deduplication Data Evaluation Tool

To aid in the evaluation of datasets we created a portable evaluation tool. When the feature is installed, DDPEval.exe is installed to the \Windows\System32\ directory. This tool can be copied and run on Windows 7 or later systems to determine the expected savings that you would get if deduplication was enabled on a particular volume. DDPEval.exe supports local drives and also mapped or unmapped remote shares. You can run it against a remote share on your Windows NAS, or an EMC / NetApp NAS and compare the savings.


Summary:

I think that this new deduplication feature in Windows Server 2012 will be very popular. It is the kind of technology that people need and I can’t wait to see it in production deployments. I would love to see your reports at the bottom of this blog of how much hard disk space and money you saved. Just copy the output of this PowerShell command: PS> Get-DedupVolume

  • 30-90%+ savings can be achieved with deduplication on most types of data. I have a 200GB drive that I keep throwing data at and now it has 1.7TB of data on it. It is easy to forget that it is a 200GB drive.
  • Deduplication is easy to install and the default settings won’t let you shoot yourself in the foot.
  • Deduplication works hard to detect, report and repair disk corruptions.
  • You can experience faster file download times and reduced bandwidth consumption over a WAN through integration with BranchCache.
  • Try the evaluation tool to see how much space you would save if you upgrade to Windows Server 2012!


Links:

Online Help: http://technet.microsoft.com/en-us/library/hh831602.aspx 
PowerShell Cmdlets:
http://technet.microsoft.com/en-us/library/hh848450.aspx

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Very very nice! So this is a complete overhaul of what SiS was/is if I follow correctly. Also, if I'm not mistaken, while SiS worked more or less at the duplicate file level, this goes deeper and de-dupes chunks of identical data from one or more different files?

  • Yes, Dedup in Windows 8 works at sub-file level. Hence if the completely different files happen to share some chunks, the identical chunks are stored only once.

  • Some really great stuff in here !

    There are some things that still remain unclear to me, perhaps you can shed a light on it:

    Will it work with DFSR ?

    Will it work with bitlocker ?

    Will it work on dynamic disks ? (mirror volumes etc.)

    What happens when I take a drive with DD enabled and plug it into machine with older operating system like 2008r2 ? Especially in situation when ‘non-deduplicated’ data would be bigger than drive real capacity.

  • Hi Bob!

    DFS-R Support: Yes, there is interoperability with DFS-R.  Optimizing or un-optimizing a file will not trigger a re-replication, since the file didn't change. DFS-R will still use RDC for over-the-wire savings and not the chunks in the chunk store. The files can be optimized using deduplication on the replica if it is running Windows Server 2012.

    Bitlocker = Yes, Bitlocker sits below us. Deduplication and NTFS don’t know that there is encryption on the disk and they function normally.

    Dynamic Disks = Yes. You can still create Dynamic Disks and put NTFS volumes on them. NTFS volumes can have deduplication applied as long as it isn't system, boot, etc.

    The down-level OS experience is briefly mentioned in the "Portability" section above. You can only read the files that have not been processed by Deduplication.

    Cheers!

    Scott

  • Thank you for answering that DFS-R question but to bad I can’t take advantage of the chunks if all replicated members use deduplication.

    Just a little scenario, let’s imagine I have a excel file that is 200mb and is deduplicated (it occupy around 30mb on disk), a user open it and change 10mb of data in it and save it. If I get it correctly the file not occupy 40mb on disk at least until the next optimization pass, am I right?

    And a last question Hyper-V hosts is listed in the not good candidates for dedup. Why is that? Because of the performance impact or because the vhd files are lock by the hypervisor and that optimizations would need the VM to be offline to succeed.

    Must of the data on a server’s system partition is static it would definitely benefit from deduplication.

  • Excellent feature !.

    Can I control memory cache for Dedup, File Server feature like CSV memroy cache ?

  • Hi Scott,

    Nice write-up, I have been experimenting with dedupe on windows server 8 beta, I had deduped a 320GB Disk (SATA), I then removed it off my Windows Server 8 Beta Computer and Plugged that drive (while it was deduped) onto my new Windows Server 2012 Box.

    On the second-box, deduped drive was coming up as Foreign Disk under Disk Management, I selected Import Foreign Disks and it came up fine.

    Now the interesting part, as this drive and it's data was de-duped on my previous Windows Server 8 Box, I noticed that for some .rar and .zip files, I can't open them, I can't open a few photos I had on this and a few other file types, video files were fine...this was a bit scary....

    I then installed the Deduplication feature under file services role and when I did a powershell "Get-DedupeVolume, it showed me the the same drive and after this I was able to access all data which was surely not opening up before installing the dedup role.

    Like in SIS on windows 2003R2 Storage Server till 2008 Storage server, un-linking was not very clean, I think this dedup feature is far more stable and reliable....

    Your thoughts will be very helpful !!

    Thanks

  • Hi Scott, I have a Server 2012 RC hyper-v host with a Server 2012 RC hyper-v guest on it.

    The guest is running DFS replication and Deduplication.

    Using DFS-R we replicated roughly 250 GB of data; the vhdx grew to 275 GB as expected.

    We then applied deduplication and the volume shrunk by roughly 50% down to 125 GB (yay!).

    Shut down the guest and ran a Compact operation on the vhdx expecting it to shrink - it remains at 275 GB.

    This effectively negates any benefit to the deduplication - is there something special we need to do to actually reclaim this space and shrink the vhd?

    Thanks!

    Wes

  • Here ya go Scott.

    Volume is a typical 39GB home folder volume for 200 users:

    PS C:\Windows\system32> Get-DedupVolume

    Enabled            SavedSpace           SavingsRate          Volume

    -------            ----------           -----------          ------

    True               10.9 GB              35 %                 E:

  • This is an interesting implementation to the OS.  Below are the results from my small server enviornment.  This is of course using "real" data, on live servers.  

    Drive T: contains all my applications, drivers, as well as ISO's, pictures, music, and roaming profiles.

    FreeSpace    SavedSpace   OptimizedFiles     InPolicyFiles      Volume                            

    ---------    ----------   --------------     -------------      ------                            

    797.53 GB    234.04 GB    165757             165757             T:                                

    I suspect my results would be about typical rather than test enviornments where the same files are copied several times.  From my understanding this would not have a large deduplication percentage in a production enviornment so much as an archiving / backup enviornment.  I can see if you have a bunch of archived VHD's this would be able to save space, but even more can be realized when it comes to a backup enviornment.  

    Compressed video's and music in general have a great deal of variation, so I would expect there to be minimal savings from them.

    If your company is in the habit of duplicating files and modifying only a small portion (say PowerPoint files) this could be a large savings in those instances.

  • Hi Jorge D.

    For your excel file example, it depends on how excel updates the file with 10MB of additional data. If excel writes to a new file of size 110MB and then deletes the original file and then renames the new file to the original file, the system will consume 140MB. The 30MB storage may be reused when the new file is deduplicated or may be reclaimed when gargage collection is run. Eventually after deduplication and garbage collection is run, the excel file should consume 40MB or less.

    If excel simply appends 10MB to the original file, then the system will consume 40MB immediately after the append. When the file is re-deduplicated, it may consume less than or equal to 40MB, depending on the deduplication and compression ratio.

    For Hyper-V host, yes, deduplication is not recommended because of performance impact and because deduplication requires a file not in use.

  • Hi Yoshihiro

    For Windows Server 2012 Release Candidate, there is no such feature in dedup like the CSV Cache.

  • Hi Mutahir,

    Regarding to the issue that you couldn't access some files on the Windows Server 2012 box before installing deduplication feature on that box, this is expected. Some of the files on a deduplicated volume are converted to deduplication specific reparse point files. They are not accessible unless deduplication feature is installed. For the files which were accessible without deduplication feature installed, it is likely that they were not converted to deduplication specific reparse point files.

  • Hi Scott, do you know if there is a bug in 2012 RTM's Windows Server Backup?  I am trying to do a dedup-aware "optimized" backup using WSB but no matter what I do, when I select a deduped volume I get the error noted here (have replicated this in a few environments now): g0b3ars.wordpress.com/.../hyper-v-3-0-server-2012-deduplication-yay-and-vhdx-files