Based on recent studies, we’ve seen enterprise organizations report that as much as 60% of their annual capital IT budget is related to storage hardware.  This may come at no surprise, with the ever increasing trend in most businesses to keep more-and-more data online and directly accessible to users. 

Of course, a large portion of consumed storage may be due to duplicated files and other redundant data stored by users on network file shares.  To help reduce disk space (and storage costs) occupied by duplicated data files, Windows Server 2012 now includes a built-in Data Deduplication feature for any NTFS volume hosted on a Windows Server 2012 file server.

This month, we’re joined by Guido van Brakel, Principal Consultant and Subject Matter Expert at Enduria.  As a frequent contributor to the IT Pro Community at-large, Guido is always eager to share new information.  In this article, Guido will walk us through our new Data Deduplication feature so that you can see how easily you can leverage it in your own environment to save on disk space and storage costs!

Keith

- - - - - - - - - -

What is Data Deduplication?

With Windows Server 2012, there is a new cool feature, called Data Deduplication. Data deduplication involves finding and removing duplication within data without compromising its fidelity or integrity. The goal is to store more data in less space by segmenting files into small variable-sized chunks (32–128 KB), identifying duplicate chunks, and maintaining a single copy of each chunk. Redundant copies of the chunk are replaced by a reference to the single copy. The chunks are compressed and then organized into special container files in the System Volume Information folder. In addition, chunks are also compressed for further space optimization.

The result is an on-disk transformation of each file as shown in the picture below. After deduplication, files are no longer stored as independent streams of data, and they are replaced with stubs that point to data blocks that are stored within a common chunk store. Because these files share blocks, those blocks are only stored once, which reduces the disk space needed to store all files. During file access, the correct blocks are transparently assembled to serve the data without calling the application or the user having any knowledge of the on-disk transformation to the file. This enables administrators to apply deduplication to files without having to worry about any change in behavior to the applications or impact to users who are accessing those files.

Data Deduplication in Windows Server 2012 is designed to be installed on primary data volumes without adding additional dedicated hardware. This means that you can install and use the feature without impacting the primary workload on the server. Ideal workloads include software deployment shares, virtual machine template folders and archived data folders, where data is not changed much and is quite static. Data deduplication requires an NTFS file system and is not supported for the new ReFS file system which is introduced in Windows Server 2012.

Cool! How Do I Install Data Deduplication?

Step #1: Add the “Data Deduplication” role service using Server Manager.

  1. From the Add Roles and Features Wizard, under Server Roles, select File and Storage Services (if it has not already been installed).
  2. Select the File Services check box, and then select the Data Deduplication check box.
  3. Click the Next button until the Install button is active, and then click the Install button.

As an alternative to the steps above, Data Deduplication can also be installed using Powershell as follows:

PS C:\> Import-Module ServerManager

PS C:\> Add-WindowsFeature -name FS-Data-Deduplication

PS C:\> Import-Module Deduplication

Step #2: Enable Data Deduplication for one or more NTFS data volumes.

  1. From the Server Manager dashboard, right-click on an existing data volume and choose Configure Data Deduplication. The Deduplication Settings page appears.
  2. Select the Enable Data Deduplication check box and enter the number of days that should elapse from the date of file modification until files are deduplicated. 
  3. You may also enter the file extensions of any file types that should not be deduplicated.  You may also click the Add button to select any folders with files that should not be deduplicated.
  4. Click the Set Deduplication Schedule button to modify the default schedule for scanning and deduplicating file data.
  5. Click the Apply button to apply these settings and return to the Server Manager dashboard.

This can also be done using Powershell:

PS C:\> Enable-DedupVolume E: 

PS C:\> Set-Dedupvolume E: -MinimumFileAgeDays 20

NOTE: If you set MinimumFileAgeDays to 0, deduplication will process all files, regardless of their age. This is suitable for a test environment, where you want to exercise maximum deduplication. In a production environment, however, it is preferable to wait for a number of days (the default is 5 days), because files tend to change a lot for a brief period of time before the change rate slows. This allows for the most efficient use of your server resources.

Step #3: Manage Data Deduplication Optimization Jobs

In Windows Server 2012 Data Deduplication, Optimzation Jobs perform the work of deduplicating data and optimizing a volume.  These jobs can be run on-demand (manually) or on a scheduled basis (as configured in Step #2 above).

You can trigger an optimization job on-demand in Windows PowerShell by using the Start-DedupJob cmdlet. For example:

PS C:\> Start-DedupJob –Volume E: –Type Optimization

You can query the progress of the job on the volume by using the Get-DedupJob cmdlet:

PC C:\> Get-DedupJob

The Get-DedupJob command show current jobs that are running or are queued to run.

How Do I Check the Integrity of a Volume?

Data Deduplication has built-in data integrity features such as checksum validation and metadata consistency checking. It also has built-in redundancy for critical metadata and the most popular data chunks. Like any stored data, as data is accessed or jobs process data, these features may encounter corruption, and they will record the corruption in a log file.  Special scrubbing jobs use these features to analyze the chunk store corruption logs and make repairs.

Repair operations can leverage three sources of redundant data:

  1. Deduplication keeps backup copies of popular chunks when they are referenced over 100 times in an area called the hotspot. If the working copy is corrupted, deduplication will use the backup.
  2. When using Storage Spaces in a mirrored configuration, deduplication can use the mirror image of the redundant chunk to serve the I/O and fix the corruption.
  3. If a file is processed with a chunk that is corrupted, the corrupted chunk is eliminated, and the new incoming chunk is used to fix the corruption.

Scrubbing jobs output a summary report in the Windows event log located here:

Event Viewer\Applications and Services Logs\Microsoft\Windows\Deduplication\Scrubbing    

Data Deduplication default schedules run a data integrity scrubbing job on a weekly basis, but you can also trigger one to run on-demand by using the following PowerShell command:

PS C:\> Start-DedupJob E: –Type Scrubbing

This initiates a job that attempts to repair all corruptions that were logged in to the deduplication internal corruption log during I/O operations to deduplicated files.

To check the data integrity of all deduplicated data on the volume, use the -full parameter:

PS C:\> Start-DedupJob E: –Type Scrubbing -full

Also known as Deep Scrubbing, the -full parameter will scrub the entire set of deduplicated data and look for all corruptions that are causing data access failures.

How Much Disk Space Can I Expect to Reclaim?

When you install the Data Deduplication role service on a server running Windows Server 2012, DDPEVAL.EXE is also installed in the C:\Windows\System32 folder as an additional command-line tool.  DDPEVAL.EXE can be run against any local NTFS volumes or NTFS network shares to estimate the amount of disk space that can potentially be reclaimed by moving that data to a Windows Server 2012 NTFS volume with Data Deduplication enabled. 

C:\> DDPEVAL \\server\folder /V /O:logfile.txt

When I’ve executed this against various shared folders on my servers, I’ve seen it compute anywhere between 30% – 80% estimated disk space reclamation, depending on the level of duplication on a volume and staleness of data.  Wouldn’t it be great to have 30% or more of your storage budget returned next year to spend on other projects?  Results may vary volume to volume, so I’d be very interested in hearing about your results!

What’s Next? Try it Yourself!

Below are steps you can take to test Data Deduplication in your own lab environment.  Let me know your results!

  • Build Your Lab Environment with Windows Server 2012 using these steps.
  • Don’t Have a Lab? Build Your Lab in the Cloud with Windows Azure!
  • Install and Enable Data Deduplication using the steps above.
  • Join the Windows Server 2012 “Early Experts” study group to learn about the other end-to-end new features in Windows Server 2012!

About Guido …

Guido van Brakel is an experienced Principal Consultant and Subject Matter Expert at Enduria.  Guido is certified on several Microsoft technologies, including Windows Server, SharePoint and Office 365.  Recently, Guido assisted in the development of training courseware for Microsoft SharePoint Server 2013.  As a frequent contributor to the IT Pro community, you will find Guido online on his blog at http://www.enduria.eu/.  Be sure to check out Guido’s blog for other great articles on Windows Server 2012!