DFS Replication Improvements in Windows Server 2012

DFS Replication Improvements in Windows Server 2012

  • Comments 35
  • Likes

Hi folks, Ned Pyle here. As promised when I left AskDS and MS Support for greener pastures, I’m still in the blogging game – I told you I’d be back! Let’s start things off talking about improvements in Windows Server 2012 and DFS Replication (DFSR).

Windows Server 2012 DFSR focuses on reliability and supportability changes based on direct field and MS Support feedback. This release doesn’t contain many new features but is much easier to troubleshoot and is more resilient to environmental issues. In the end, that makes your life easier. And every IT department could use some easier…

clip_image002[4]
If this is your daily routine, we can help

I can only assume you already know DFSR from all of my old write-ups, so let’s dive into the details.

Unexpected shutdown worker progress

DFSR uses a per-volume ESE (aka “Jet”) database to track all file changes in replicated folders on their individual volumes. DFSR contains code to attempt graceful and dirty recovery of the database after an unexpected shutdown. Mallikarjun Chadalapaka has a great write-up on dirty shutdown recovery here.

Previous OS behavior

On detecting a dirty shutdown, DFSR begins a recovery process. This starts with logging event 2212:

Event ID=2212

Severity=Warning

The DFS Replication service has detected an unexpected shutdown on

volume %2. This can occur if the service terminated abnormally (due to

a power loss, for example) or an error occurred on the volume. The

service has automatically initiated a recovery process. The service

will rebuild the database if it determines it cannot reliably

recover. No user action is required.

Additional Information:

Volume: %2

GUID: %1

If the recovery is successful, DFSR logs event 2214:

Event ID=2214

Severity=Informational

The DFS Replication service successfully recovered from an unexpected

shutdown on volume %2.This can occur if the service terminated

abnormally (due to a power loss, for example) or an error occurred on

the volume. No user action is required.

Additional Information:

Volume: %2

GUID: %1

If the recovery is unsuccessful, DFSR logs event 2216:

Event ID=2216

Severity=Error

The DFS Replication service failed to recover from an unexpected

shutdown on volume %2. This can occur if the service terminated

abnormally (due to a power loss, for example) or an error occurred on

the volume. Recovery will be attempted periodically in %3 seconds. No

user action is required.

Additional Information:

Error: %4 (%5)

Volume: %2

Guid: %1

DFSR didn’t log how a recovery was progressing, though. This makes troubleshooting tricky and we found that sometimes customers would think the recovery had hung or halted, and they’d start trying to fix things (perhaps making things worse).

Windows Server 2012 behavior

Two new event log messages now appear that describe where the internal repair process stands. You now know that DFSR has moved past the detection phase and into the consistency checking and rebuilding phase.

Event ID=2218

Severity=Informational

Message=

The DFS Replication service is in the second step of replication database

consistency checks after an unexpected shutdown. The database will be

rebuilt if it cannot be recovered. No user action is required.

 

Additional Information:

Volume: %2

GUID: %1

 

Event ID=2220

Severity=Informational

Message=

The DFS Replication service is in the third step of replication database

consistency checks after an unexpected shutdown. Database recovery is

currently in progress. No user action is required.

 

Additional Information:

Volume: %2

GUID: %1

Just be patient – it will complete. If in doubt, contact Microsoft Support – don’t try to get out and push.

Performance registry defaults

DFSR contains registry overrides to control behaviors like the number of files to replicate simultaneously, stage simultaneously, etc.

Previous OS behavior

The default settings in Windows Server 2008 R2 were a bit too conservative. After release, we tested tweaked registry settings that resulted in roughly double the performance of default settings:

Windows Server 2012 behavior

These more aggressive settings are now the default in Windows Server 2012 (if not overridden in the registry by you):

  • AsyncIoMaxBufferSizeBytes
  • New default value: 8388608
  • RpcFileBufferSize
  • New default value: 524288   
  • StagingThreadCount
  • New default value: 8    
  • TotalCreditsMaxCount
  • New default value: 4096    
  • UpdateWorkerThreadCount
  • New default value: 32

The allowed ranges are unchanged except for UpdateWorkerThreadCount (see below).

UpdateWorkerThreadCount max

UpdateWorkerThreadCount controls the number of simultaneously inbound-replicating files to a DFSR server.

Previous OS behavior

The maximum configurable range in Windows Server 2008 R2 is 64. If you set the maximum allowed value for UpdateWorkerThreadCount to 64, it is possible to see intermittent DFSR service deadlocks. This manifests as a hung service, which for customers is nearly impossible to troubleshoot (you need a debugger and private symbols). Because the issue may not happen for days or weeks, there is no easy way to correlate cause and effect.

Windows Server 2012 behavior

The maximum value is now 63. Voila!

Read Only Domain Controller support for DFS Management

Administrators use the DFS Management snap-in (Dfsmgmt.msc) for all graphical configuration of DFSR.

Previous OS behavior

DFS Management was introduced in Windows Server 2003 Service Pack 1 introduced the, long before read-only domain controllers (RODCs). It expected all domain controllers to be writable when creating a replication group or any other AD objects. When DFS Management tries to write to an RODC, it fails with an access denied error. This issue has existed since Windows Server 2008, but since RODC usage was lower and RODCs tend to exist mainly in branch offices, we never saw it until much later. Now that RODCs are everywhere, well…

Windows Server 2012 behavior

DFS Management now requests only writable domain controllers when making DC queries.

Read-only disconnected topology detection

DFS Management contains a topology checking routine to alert administrators when they have created an incomplete (aka "disconnected") DFS replication topology. A disconnected topology prevents eventual replication of data, leading to divergence, user confusion, and potential data loss.

Previous OS behavior

A bridged topology of A <-> B <-> C is not flagged as disconnected when B is a read-only replicated folder. Because there is no outbound replication on a read-only member, any files created on A or C will not replicate further than B, so users on A and C will potentially see different versions of files, or no files at all.

clip_image002

Windows Server 2012 behavior

The topology checker code now understands the bridged read-only replicated folder scenario and appropriately warns you when detected.

4412 conflict event data

DFSR uses a series of conflict resolution algorithms to detect file collisions and appropriately handle a winning and losing file. DFSR notes these in a per-collision 4412 informational event log entry.

Previous OS behavior

The 4412 event did not contain quite enough information easily troubleshoot unexpected collisions. For example:

Message=

The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

Additional Information:

Original File Path: D:\Windows\SYSVOL\domain\Policies\{E75E8CC5-27B3-483F-AA79-FFF726236A0A}\Adm

New Name in Conflict Folder: Adm-{EE271589-88F7-4E8C-A057-013CF75B352B}-v294528

Replicated Folder Root: D:\Windows\SYSVOL\domain

File ID: {3351DB9B-9DAF-4273-90C1-FC347266BBD2}-v29180999

Replicated Folder Name: SYSVOL Share

Replicated Folder ID: 29578A90-233A-48B7-B8C3-1BB0A05873EC

Replication Group Name: Domain System Volume

Replication Group ID: 70AC3FC4-60FC-4D15-964D-AE0F96098E60

Member ID: C6D34675-591E-4FC9-B88E-06AFC659CAED

Windows Server 2012 behavior

The 4412 event message now contains an additional field of Partner Member ID that lists the winning server's identity.

Message=

The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

Additional Information:

Original File Path: D:\Windows\SYSVOL\domain\Policies\{E75E8CC5-27B3-483F-AA79-FFF726236A0A}\Adm

New Name in Conflict Folder: Adm-{EE271589-88F7-4E8C-A057-013CF75B352B}-v294528

Replicated Folder Root: D:\Windows\SYSVOL\domain

File ID: {3351DB9B-9DAF-4273-90C1-FC347266BBD2}-v29180999

Replicated Folder Name: SYSVOL Share

Replicated Folder ID: 29578A90-233A-48B7-B8C3-1BB0A05873EC

Replication Group Name: Domain System Volume

Replication Group ID: 70AC3FC4-60FC-4D15-964D-AE0F96098E60

Member ID: C6D34675-591E-4FC9-B88E-06AFC659CAED

Partner Member ID: 2716E4E2-ED01-4285-9137-FACB4EE84C4A

You can use DFSRDIAG GUID2NAME to translate that partner GUID into a human-friendly name. For example:

image
Aha! FSF-02 won.

Editions restrictions removed

There is no Windows Server 2012 Enterprise Edition; instead, you can purchase Windows Server 2012 Standard or Windows Server 2012 Datacenter, which is no longer an OEM-only SKU and exists to provide unlimited virtualization licenses.

Previous OS behavior

DFSR cross-file Remote Differential Compression (RDC) support ties to the server edition being Enterprise or Datacenter. DFSR Cluster support ties to Enterprise or Datacenter editions as well, through internal checks. Implicitly, DFSR cluster support requires enterprise and higher because the Failover Cluster features only exist on those editions.

Windows Server 2012 behavior

All edition checks are removed and Windows Server 2012 has full DFSR capabilities even in Windows Server 2012 Standard.

Initial sync to read-only replicated folders with preexisting data

Read-only (RO) replicated folders are always non-authoritative and do not allow local changes by use of an IO-blocking filter driver named dfsrro.sys. You are encouraged to pre-seed data before initial sync, meaning that data can already exist when DFSR is configured on two or more servers.

Previous OS behavior

Windows Server 2008 R2 SP1 introduced a regression (that we recently fixed) where initial sync from Read Write (RW) to RO does not overwrite file differences on the RO. This leads to data inconsistencies in the replication groups, as these differing files will never be right on RO servers unless they are later modified again on the RW. Which rather defeats the purpose of pre-seeding.

Windows Server 2012 behavior

This is fixed. :)

DC port 5722

DFSR uses TCP/IP and RPC to replicate files, and we finally fixed an old scenario where domain controllers differed in port usage from member servers.

Previous OS behavior

In Windows Server 2008 and Windows Server 2008 R2, a domain controller replicating SYSVOL and/or custom replicated folders with DFSR used TCP port 5722. This was due to a bug I discussed back on AskDS.

Windows Server 2012 behavior

This is also fixed. Now DCs will operate consistently like member servers, listening on a dynamic port in the 49152 – 65535 range unless you choose to hard code a port. If you have gotten used to 5722 and reaaaaally like using hard-coded ports, you can return to the old behavior with command:

Dfsrdiag.exe staticrpc /port:5722

I doubt the person who takes over your job someday will thank you for it though…

Fixed missing DFSR migration event 6806

When using DFSRMIG.EXE to migrate your SYSVOL from using FRS to DFSR, event log entries tell you how things are proceeding and if there are any problems you need to investigate before moving to the next phase.

Previous OS behavior

In Windows Server 2008 R2, a timing issue could give you an expected warning 6804 with the rather scary message:

The DFS Replication service has detected that no connections are configured for replication group Domain System Volume. No data is being replicated for this replication group.

Once AD replication and the migration caught up, we should have logged a 6806 event saying everything was fine. But we forgot to. Errp.

Windows Server 2012 behavior

Now we log that missing 6806 event letting you know that all is well and migration is working.

Replicated folder removal and replication

Replicated folders are the base of replication and the top level of a content set in DFSR database terms.

Previous OS behavior

In Windows Server 2008 R2, removing a replicated folder stopped replication of all other RFs until the removal completed.

Windows Server 2012 behavior

Now you can remove a replicated folder (thereby causing DFSR to update its DFSR database and stop replicating that content set) and not see other replicated folders pause replication. This keeps a hub server working efficiently when you decide to decommission a branch node. Faster also implicitly means increased reliability, as we are not spending large amounts of time with replication halted.

Staging messaging

Windows Server 2008 R2 SP1 introduced a little-known hotfix to update the Dfsmgmt.msc wizards for new replication groups and new replication wizards. This provides further guidance around configuring the staging folder quota to prevent performance bottlenecks.

image

This capability is now native to Windows Server 2012.

Added support for Dedup, FCI, and DAC file modifications

Data Deduplication support

We modified the DFSR allowed reparse point replication rules to support replicating the new IO_REPARSE_TAG_DEDUP tag. This type of reparse point tag is part of the new file deduplication system. This isn’t truly reparse point replication; file is “rehydrated” and replicated as a normal file then put back into its dedup’ed state on the downstream. Slick.

File Classification Infrastructure support

We modified File Classification Infrastructure (FCI) to prevent re-writing unchanged data to the alternate data stream on files during classification passes. This previously caused replication storms in Windows Server 2008 R2. Note: you should still only configure FCI on one server (usually the hub), not multiple servers.

Dynamic Access Control Support

Changes made to APIs used to access new NTFS data structures for auditing and conditional ACE security required updates to DFSR in Windows Server 2012. Because Windows Server 2008 R2 and older operating systems do not implement these APIs though (and therefore cannot use or display these ACLs) they did not require changes. Therefore, there is no back port required to configure replication between a Windows Server 2008 R2 and Windows Server 2012 replicated folder.

But!

Microsoft strongly discourages mixed Windows Server 2012 and legacy operating system DFSR.

There are significant NTFS security data differences between Windows Server 2012 and earlier operating systems, often to facilitate Dynamic Access Control features. Moreover, any claims-based access configuration will not work consistently in a design that allows users to connect to Windows Server 2008 R2 and Windows Server 2012 versions of a replicated file; one server might grant more or less access than the other.

For example, if someone modifies the security of a file on a Win2008 R2 server, DFSR packages that up with the file (this is called “marshalling”) and sends it along as-is. When a user attempted to access the file on the Win2012 server, the Claims-based security elements would no longer exist, and the user would be denied access. More troubling, if you were letting users access the data from multiple DFSN-provided shares, they would be calling you with the infamous “it sometimes works and sometimes fails” symptom that drives IT pros batty.

However!

Central Access Policies modify individual files and folders to contain a special SID in the tail of the SACL structure when adding the CAP rules the first time. This means that first applying a CAP triggers replication of all folders and files replicated under the auspices of the CAP structure, just like it would with any other security change to the classic DACL.

Subsequent changes to the rules of an already-added CAP do not alter the files, however – this is the beauty of Central Access Policy. This means that once replication completes, you can change the security on files without triggering further replication. This is a seriously cool feature if you are a DFSR administrator, and it means once you deploy CAP, further security changes to an existing policy are completely non-intrusive to replication!

Ideally, configure CAP and File Classification Infrastructure on the file structure before configuring DFSR; that way you only pay the replication price once during DFSR initial sync. And to reiterate, use Windows Server 2012 on all nodes before deploying DAC. If you need help migrating existing DFSR environments, I recommend this series. It goes without saying that when using Windows Server 2012, CAP/DAC will only be effective if you apply the CAP to all nodes being replicated - otherwise you end up with differing security per node.

ReFS

DFSR does not support ReFS volumes, as this new file system removes many critical data types used or supported by DFSR, such as streams, sparse files*, compressed files, 8.3 names, extended attributes, etc.

* Update Jan 9, 2013 - it turns out (despite what you will read on most of the Internet, including the Build 8 blog) that we added sparse file support to ReFS right at the tail end of development. So it's there.

DFSR does not allow you to replicate ReFS volumes. The service checks to make sure you are using NTFS and it will fail, gracefully. You cannot replicate a volume with ReFS locally; the DFSR service will not allow it.

Dfsmgmt.msc prevents an administrator from accidentally configuring a ReFS volume. Even if you pre-create the folder and use DFSRADMIN to bypass the check, DFSR prevents replication with event 6404, ("The local path is not the fully qualified path name of an existing, accessible local folder."). The debug log will show error 9225 ("volume was not found")

clip_image002[6]
No ReFS allowed!

CSV

Just like Windows Server 2008 R2, DFSR in Windows Server 2012 does not support Cluster Shared Volumes (CSV).

Autorecovery Disabled

Just like Windows Server 2008 R2, DFSR in Windows Server 2012 includes the database autorecovery change:

  • KB 2663685 - Changes that are not replicated to a downstream server are lost on the upstream server after an automatic recovery process occurs in a DFS Replication environment in Windows Server 2008 R2 - http://support.microsoft.com/kb/2663685

Complex nested folder creation-deletion-replication fix

Just like Windows Server 2008 R2, DFSR in Windows Server 2012 includes the latest reliability changes for handling complex nested file and folder creation and deletion on partner nodes:

  • KB 2450944 - Some folders or files are unexpectedly deleted on the upstream server after you restart the DFS Replication service in Windows Server 2003 R2, in Windows Server 2008 or in Windows Server 2008 R2 - http://support.microsoft.com/kb/2450944

File creation conflict algorithm

Windows Server 2012 changes the only disparate file conflict resolution previously algorithm used from first creator wins to last creator wins, in order to be more consistent. For more information about this topic, see this article.

Keep alive support added for huge files

Windows Server 2012 now correctly allows very large (many many GB) files to complete computation of RDC signatures before the RPC server connection times out. In prior OSes the file would never replicate due to timing constraints. This mainly happened with files that were hundreds of GB.

But!

64GB files are still the supported maximum. So this is us being nice and helping you in a scenario that is technically, still unsupported.

As a final note: I didn’t include all the fixes released as updates to Windows Server 2008 R2 that are also part of Windows Server 2012, just the more interesting ones. So as a rule of thumb, if you got a hotfix for Win2008 R2 before Win2012 RTM’ed, the latter has the update built-in.

And that’s it. Nice, eh?

- Ned “it’s all good” Pyle

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Great post! Thanks, very informative!

  • Second that - great post!

    I'm looking to update a client to 2012 mainly for the Hyper-V improvements and DFSR improvements.  Now I'm just tossing up whether to use DFSR for a chunk of the offsite backup still or just switch to Hyper-V replica.   (they still have onsite backups - I just like to have belts & braces)   ;)

  • Excellent post. Thanks.

    Can a 2012 HUB be introduced into an existing 2008R2 environment? I've a new file server to through into a country and will make that 2012, would be ideal to have that replicate to a 2012 HUB here at Head Office, where we already have 2008R2.

  • I actually find these "improvements" quite disappointing. These are just bug fixes and performance tweaks, most of which can be applied to Windows Server 2008 R2.

    Where's the GUI for resolving conflict files? Where's cross-server file locking so that we don't have to use third-party apps like PeerLock? Why wasn't the 64 GB file size limit increased? Why can't we use full folder paths for subfolder and file filters?

    Those would be highly useful, very practical real-world improvements for DFSR. Businesses were hoping to get these kinds of improvements with Windows Server 2012...

  • @ Noel - yes, you'd just follow the steps here: blogs.technet.com/.../series-wrap-up-and-downloads-replacing-dfsr-member-hardware-or-os.aspx. Remember the point above that you should not deploy claims-based access/central access policy to these machines until you have all Win2012 though.

    @Taylorbox - Not most, but I understand where you're coming from. This is the iterative process - things take time, and most* of your requests are well-known and desirable; i.e. no one is arguing that they are bad ideas or that we don't want to do them, only that we have finite resource here. A lot of development energy went into other (massive) technologies in Win2012 and that left DFSR a bit starved this last go-around.

    *I've not heard this one before: "Where's the GUI for resolving conflict files?" Can you explain this one to me in more detail? Do you mean restoring files from C&D/Preexisting, like with restoredfsr.vbs, or something else?

  • @ NedPyle - Thanks for replying. I had already supposed that development resources for DFSR didn't get much love this last go-around...despite the billions Microsoft has available.  :o)

    Third-party cross-server file replication/syncing software such as SureSync and PeerLink offer the following features for conflict files:

    -A GUI that admins can access to easily see which files are in conflict, who edited them, when, and which offers conflict resolution options

    -Email alerts for conflicts, which provide the information mentioned above (so much simpler than having to work with the Event logs)

    -The option to not replicate conflict files until an admin manually intervenes

    Such a GUI and features would equip admins with the tools necessary to handle conflicts much more easily than having to dig into the Conflict & Deleted folder and sort through the Event logs.

  • My pleasure - the whole idea here with Comments is to have a dialogue. Even when you are disappointed. ;-P

    That's a useful case to understand. So in technologies where conflicts are unexpected (as they are not DFSR-style multi-master) you like having the ability to resolve conflicts at the point of detection. What size of dataset and churn are you seeing in your environment with this setup. I.e. how many conflicts do you have to mediate a day, on average? When would it be too many?

    I agree that the C&D folder as implemented is not useful for 99% of customers, since it has no way to extract data.

  • Holy cow, lots of behind the scenes stuff. Great writeup Ned!

  • Excellent.

  • @ NedPyle - Yes, the ability to be notified about conflicts at the point of detection with an easily configured email telling the who, what, where, and when would be very beneficial (rather than having to configure Tasks with scripts for Event 4412), as well as being able to resolve conflicts directly in a GUI.

    Although we're not a large enterprise operation, the key point to us is we don't want to risk having a very important business doc overwritten and we only find out after the fact, and then we have to go try digging it out of the C&D folder. Even if that scenario doesn't happen often, it's more about the quality of the DFSR service over the quantity of conflicts that happen...

  • That is great food for thought, Taylorbox.

    Thanks everyone. :)

  • Any new guidance on scaling limits?  This article (technet.microsoft.com/.../cc773238(v=ws.10).aspx) seems to indicate not, but just curious:

    "The following list provides a set of scalability guidelines that have been tested by Microsoft on Windows Server 2012 Windows Server 2008 R2 and Windows Server 2008:

    Size of all replicated files on a server: 10 terabytes.

    Number of replicated files on a volume: 11 million.

    Maximum file size: 64 gigabytes."

  • @ Ed Swindelles

    We're thinking very hard about this. :) We recently changed the "Number of replicated files on a volume:" to 11 million from the previous 8 million, based on some internal testing. Otherwise, for now, it's business as usual. If this changes I will be sure to write a new post and sing it loud as this is by far our most common request these days.

  • @NedPyle[MSFT] Further to what Ed Swindelles was asking, that 10TB hard cap is KILLING me.  This is a chance for Microsoft to really make a definitive feature statement that would elevate them above software vendors like Vision Solutions DoubleTake.   Even the ability to create a DFS namespace up to 10TB *per virtualized storage pool entity* on a single server would be a step forward here. We can have multiple storage pools, so why not allow us to have discrete DFS replication limits per pool instead of per server?  I've got a deal on the stove right now that will probably get tossed because the reseller and distributor were unaware of the 10TB limit on DFS and didn't engage an HP Storage SA (like myself) early on when I could've suggested something else (low end SAN w/remote replication), rather than now when the products have already shipped. They shipped with WSS2008R2, but I could probably save the deal if I could say the problem would go away with WSS2012 next year...

  • I hear you, The_Rob_HP. This is the top priority for us (no exaggeration).