The Storage Team Blog about file services and storage features in Windows and Windows Server.
This post is a part of the nine-part “What’s New in Windows Server & System Center 2012 R2” series that is featured on Brad Anderson’s In the Cloud blog. Today’s blog post covers Data Deduplication and how it applies to the larger topic of “Transform the Datacenter.” To read that post and see the other technologies discussed, read today’s post: “Delivering Infrastructure as a Service (IAAS).”
In Windows Server 2012 we introduced the new Data Deduplication feature set that quickly became one of standard things to consider when deploying file servers. More space on existing hardware at no cost other than running Windows Server 2012? Seems like a pretty good deal.
Not to mention we saw great space savings on various types of real-world data at rest. Some of the most common types of data include:
These numbers are based on measuring the savings rates on various customer deployments of Data Deduplication on Windows Server 2012. However, we saw some interesting trends:
In both cases we see people try to put more data under Data Deduplication and to take better advantage of those huge savings seen on static VHD libraries. However, Data Deduplication in Windows Server 2012 was not really designed to deal with data that changes frequently or even is in active use.
The customer feedback we were getting showed a clear need to reduce storage costs in private clouds (see http://blogs.technet.com/b/in_the_cloud/archive/2013/07/31/what-s-new-in-2012-r2-delivering-infrastructure-as-a-service.aspx for an overview of all the other new things around storage) and specifically to extend Data Deduplication for new workloads.
Specifically we needed to start supporting storage of live VHDs for some scenarios.
It turns out that there were a few key changes that had to be made to even consider using Data Deduplication for open files:
We also realized that all of this would take up resources on the server running Data Deduplication. If we were to run this on the same server as the VMs, then we’d be competing with them for resources. Especially memory. So we quickly came to the conclusion that we needed to separate out storage and computation nodes when Data Deduplication was involved with virtualization.
Of course that meant we had to use a scale out file share and therefore needed to support CSV volumes for deduplication.
Then we came to the question of how fast do we have to get all of these things working to be successful? Well… as fast as possible. However, we know that Data Deduplication has to incur some costs. So we needed real goals. It turns out that deciding that you are fast enough for all virtualization scenarios is very difficult. So we decided to take a first step with a virtualization workload that was well understood:
Data Deduplication in Windows Server 2012 R2 would support optimization of storage for Virtual Desktop Infrastructure (VDI) deployments as long as the storage and compute nodes were connected remotely.
With the Windows Server 2012 R2 Preview, Data Deduplication is extended to the remote storage of the VDI workload:
We spent a lot of time to ensure that Data Deduplication performs correctly on general virtualization workloads. However, we focused our efforts to ensure that the performance of optimized files is adequate for VDI scenarios. For non-VDI scenarios (general Hyper-V VMs), we cannot provide the same performance guarantees.
As a result, we do not support deduplication of arbitrary in use VHDs in Windows Server 2012 R2. However, since Data Deduplication is a core part of the storage stack, there is no explicit block in place that prevents it from being enabled on arbitrary workloads.
We will start with the easy one: You will save space! And of course, saving space translates into saving money. Deduplication rates for VDI deployments can range as high as 95% savings. This allows for deployments of SSD based volumes for VDI, leveraging all the improved IO characteristics while mitigating their low capacity.
This also allows for simplification of the surrounding infrastructure such as JBODs, cooling, power, etc.
On the other hand, due to the fact that Data Deduplication consolidates files, more efficient caching mechanisms are possible. This results in improving the IO characteristics of the storage subsystem for some types of operations. So not only does deduplication save money, it can make things go faster.
As a result of these, we can often stretch the VM capacity of the storage subsystem without buying additional hardware or infrastructure.
Data Deduplication in Windows Server 2012 R2 enables optimization of live VHDs for the VDI workloads and allows for deduplicated CSV volumes. It also significantly improves the performance of optimization as well as IO on optimized files. This will allow better utilization of existing storage subsystems for general file servers as well as for VDI storage and simplify future infrastructure investments.
We hope you find these new capabilities as exciting as we find them and look forward to hearing from you.
To see all of the posts in this series, check out the What’s New in Windows Server & System Center 2012 R2 archive.
Is it now possible to use deduplication on ReFS (redundant) volumes?
There is one specific scenario in addition to VDI that I would love to see supported. And that would be enabling DeDupe on the storage VHDXs for a virtualized DPM server. If the DPM server is backing up the system drives of a lot of servers, the space savings would be enormous because 90% of the data would be common.
Agree 100% with Michael.... SQL file backups, server system drive backups... I can think of a ton of space savings in our environment. We keep having to add storage just to allow backups to happen efficiently. We need this support or we will have to start looking at SAN storage that does de-dupe for us (NetApp)
Will Microsoft support deduplication for DPM 2012 R2 storage pools on a VHDX disks? We're forced to look at other backup alternatives due to inefficient DPM storage utilization. We'd also like to see DPM utilize more than one tape library per protection group - that would make it usable in larger environments.
What is considered a VDI workload? How many IOPs or bytes read/write? Less than 20MB/sec total (which is the rate at which de-dupe can work?)