The Storage Team Blog about file services and storage features in Windows and Windows Server.
This post is a part of the nine-part “What’s New in Windows Server & System Center 2012 R2” series that is featured on Brad Anderson’s In the Cloud blog. Today’s blog post covers Data Deduplication and how it applies to the larger topic of “Transform the Datacenter.” To read that post and see the other technologies discussed, read today’s post: “Delivering Infrastructure as a Service (IAAS).”
With the Windows Server 2012 R2 Preview, Data Deduplication is extended to the remote storage of the VDI workload:
CSV Volume support
Faster deduplication of data
Deduplication of open (in use) files
Faster read/write performance of deduplicated files
See http://blogs.technet.com/b/filecab/archive/2013/07/31/extending-data-deduplication-to-new-workloads-in-windows-server-2012-r2.aspx for more details.
To start with: You will save space! Deduplication rates for VDI deployments can range as high as 95% savings. Of course that number will vary depending on the amount of user data, etc and it will also change over the course of any one day.
Data Deduplication optimizes files as a post processing operation. That means, as data is added over the course of a day, it will not be optimized immediately and take up extra space on disk. Instead, the new data will be processed by a background deduplication job. As a result, the optimization ratio of a VDI deployment will fluctuate a bit over the course of a day, depending on home much new data is added. By the time next optimization is done, savings will be high again.
Saving space is great on its own, but it has an interesting side effect. Volumes that were always too small, but had other advantages are suddenly viable. One such example are SSD volumes. Traditionally, you had to deploy very many of these drives to reach volume sizes that were viable for a VDI deployment. This was of course expensive for the disks, but also considering the increased needs for JBODs, power, cooling, etc. With Data Deduplication in the picture SSD based volumes can suddenly hold vastly more data and we can finally utilize more of their IO capabilities without incurring additional infrastructure costs.
On the other hand, due to the fact that Data Deduplication consolidates files, more efficient caching mechanisms are possible. This results in improving the IO characteristics of the storage subsystem for some types of operations.
As a result of these, we can often stretch the VM capacity of the storage subsystem without buying additional hardware or infrastructure.
This turns out to be relatively straight forward, assuming you know how to setup VDI, of course. The generic VDI setup will not be covered here, but rather we will cover how Data Deduplication changes things. Let’s go through the steps:
First and foremost, to deploy Data Deduplication with VDI, the storage and compute responsibilities must be provided by separate machines.
The good news is that the Hyper-V and VDI infrastructure can remain as it is today. The setup and configuration of both is pretty much unaltered. The exception is that all VHD files for the VMs must be stored on a file server running Windows Server 2012 R2 Preview. The storage on that file server may be directly attached disks or provided by a SAN/iSCSI.
In the interest of ensuring that storage stays available, the file server should ideally be clustered with CSV volumes providing the storage locations for the VHD files.
Create a new CSV volume on the File Server Cluster using your favorite tool (we would suggest System Center Virtual Machine Manager). Then enable Data Deduplication on that volume. This is very easy to do in PowerShell:
This is basically the same way Data Deduplication is enabled for a general file share, however it ensures that various advanced settings (such as whether open files should be optimized) are configured for the VDI workload.
In the Windows Server 2012 R2 Preview one additional step has to be done that will not be required in the future. The default policy for Data Deduplication is now to only optimize files that are older than 3 days. This of course does not work for open VHD files since they are constantly being updated. In the future, Data Deduplication will address this by enabling “Partial File Optimization” mode, in which it optimizes parts of the file that are older than 3 days. To enable this mode in the Preview, run the following command
Deploy VDI VMs as normal using the new share as the storage location for VHDs.
With one caveat.
If you made a volume smaller than the amount of data you are about to deploy on it, you need some special handling. Data Deduplication runs as a post-processing operation.
Let us say we want to deploy 120GB of VHD files (6 VHD files of 20GB each) onto a 60 GB volume with Data Deduplication enabled.
To do this, deploy VMs onto the volume as they will fit leaving at least 10GB of space available. In this case, we would deploy 2 VMs (20GB + 20GB + 10GB < 60GB). Then run a manual deduplication optimization job:
Once this completes, deploy more VMs. Most likely, after the first optimization, there will be around 10GB of space used. That leaves room for another 2 VMs. Deploy these 2 VMs and repeat the optimization run.
Repeat this procedure until all VMs are deployed. After this the default background deduplication job will handle future changes.
Once everything is deployed, managing Data Deduplication for VDI is no different than managing it for a general file server. For example, to get optimization savings and status:
It may at times occur that a lot of new data is added to the volume and the standard background task is not able to keep up (since it stops when the server gets busy). In that case you can start a “throughput” optimization job that will simply keep going until the work is done:
Overall, deploying Data Deduplication for VDI is relatively simple operation, though it may require some additional planning along the way.
To see all of the posts in this series, check out the What’s New in Windows Server & System Center 2012 R2 archive.
Can I use VDI deduplication when the storage is direct attached to the Hyper-V Cluster Hosts?