Greetings everyone. Warren here again with a blog post that is a collection of common DFSR issues that I have encountered over the past few years. The goal of this blog post is to list these common DFSR configuration mistakes that have caused problems to prevent you making the same mistake. Knowing what not to do is as important as knowing what to do. Many of these points are addressed in other content so links are provided for more in-depth reading.

 

  • Too small of a Staging Area Quota

Are you seeing a lot of event ID’s 4202 and 4204? If so, your staging area is not sized correctly. The downside to an improperly sized staging area is that replication performance will be negatively affected as the service has to spend time cleaning up the staging area instead of replicating files.

DFSR servers are more efficient with a full staging area for at least these two reasons:

  1. It is much more efficient to stage a file once and send it to all downstream partners than to stage a file, replicate the file, purge for each downstream partner.
  2. If at least one member is running Enterprise Edition the servers can take advantage of Cross File RDC

An improperly sized staging area can also cause a replication “loop” to occur. This condition happens when a file get replicated to the downstream server and is present in the staging however the file is purges by the staging area cleanup process before the file can be installed into the Replicated Folder. The purged file will be replicated again to the server that just purged it from its staging as it never got to install the file. This process will keep repeating until the file gets installed.

Don’t ignore staging area events.

See this blog post for the method to use to determine your minimum staging area size

See the section “Increase Staging Quota” here

See “Remote Differential Compression details” here for information on Cross File RDC

  • Improper or Untested Pre-seeding procedure

Pre-seeding is the act of copying the data that will be replicated to a new replica member before they are added to the Replicated Folder with the goal of reducing the amount of time it takes to complete initial replication. Most failed pre-seeding cases I have worked on failed in 3 ways.

  1. ACL mismatch between source and target.
  2. Changes were made to the files after they were copied to the new member
  3. No testing was done to verify the pre-seeding process they were using worked as expected.

In short the files must be copied in a certain way, you cannot change the files after they are staged and you must test your process.

Click here to read Mr. Pyle’s blog on how to properly pre-seed your DFSR servers

  • High backlogs for extended periods of time

Besides the fact that having high backlogs for extended periods of time means your data is out of sync, it can lead to unwanted conflict resolution where a file with older content wins in a conflict resolution scenario. The most common way I have seen this condition hit is when rolling out new RF’s . Instead of doing a phased rollout some admins will add 20 new RF’s from 20 different branch offices at once overloading their hub server. Stagger your rollouts so that initial replication is finished in a reasonable amount of time.

  • DFSR used as a backup solution

Believe it or not some admins run DFSR without backing up the replicated content offline. DFSR was not designed as a backup solution. One of DFSR’s design goals is to be part of an enterprise backup strategy in that it gets your geographically distributed data to a centralized site for backup, restore and archival. Multiple members do offer protection from server failure; however, this does not protect you from accidental deletions. To be fully protected you must backup your data.

  • One way Replication – Using it, or Fixing One way replication the wrong way

In an attempt to prevent unwanted updates from occurring on servers where they know the data will never be changed (or they don’t want changes made there) many customers have configured one-way replication by removing outbound connections from replica members. One-way replication is not supported on any version of DFSR until Windows Server 2008 R2. On Windows 2008 R2 one-way replication is supported provided you configure Read-Only replicated folders. Using Read –Only DFSR members allows you to accomplish the goal of one-way replication, which is preventing unwanted changes to replicated content. If you must use one-way replication with DFSR then you must use Windows 2008 R2 and mark the members as read-only where changes to content should not occur.

Click here and here to learn about Read-Only DFSR replicas.

Another common problem that occurs is when an Admin discovers that one way replication is not supported they go about fixing it the wrong way. Simply enabling two-way replication again can have undesirable results. See the blog post below on how to recover from one-way replication.

Click here to learn about fixing one-way replication.

  • Hub Server - Single Point of Failure or Overworked Hub Servers

I have seen many deployments with just one hub server. When that hub server fails the entire deployment is at risk. If you’re using Windows Server 2003 or 2008 you should have at least 2 hub servers so that if one fails the other can handle the load while the other is repaired with little to no impact to the end users. Starting with Windows Server 2008 R2 you can deploy DFSR on a Windows Failover Cluster which gives you high availability with half of the storage requirement.

Other times admins will have too many branch office servers replicating with a single hub server. This can lead to delays in replication. Knowing how many branch office servers a single hub server can service is a matter of monitoring your backlogs. There is no magic formula as each environment is unique and there are many variables.

Read the “Topology Tuning” section here for ideas on deploying hub servers.

Click here to learn how to setup DFSR a Windows Server 2008 Fail-Over Cluster.

  • Too many Replicated Folders in a single Jet Database

DFSR maintain one Jet database per volume. As a result placing all of your RFs on the same volume puts them all in the same Jet Database. If that Jet database has a problem that requires repair or recovery of the database all of the RF’s on that drive are affected. It is better to spread the RFs out using as many drives as possible to provide maximum uptime for the data.

  • Bargain Basement iSCSI deployments

I have seen more than one DFSR iSCSI deployment where the cheapest hardware available was used. Usually if you are using DFSR it is for some mission critical purpose such as data redundancy, backup consolidation, pushing out applications and OS upgrades on a schedule. Depending on low-end hardware with little to no vendor support is not a good plan. If the data is important to your business then the hardware that runs the OS and replication engine is important to your business.

  • Did not maintain the DFSR service at the current patch level

DFSR is actively maintained by Microsoft and has updates released for it as needed. You should update DFSR when a new release is available during your normal patch cycle. Make sure your servers are up to date per the KB articles listed below.

Windows 2003 R2 DFSR Hotfixes

Windows 2008 and Windows 2008 R2 DFSR Hotfixes

You will notice that updates are listed for NTFS.SYS and other files besides DFSR.EXE/DFSRS.EXE. For replication always make sure DFSR and NTFS are at least at the highest version listed. The other patches listed mostly deal with UI issues that you will at minimum want installed on the systems you perform DFSR configuration tasks on.

Proactively patching the DFSR servers is advisable even if everything is running normally as it will prevent you from getting effected by a known issue.

  • Did not maintain NIC Drivers

DFSR will only work as well as the network you provide for it. Running drivers that are 5 years old is not smart. Yes, I have talked with more than a few customers with NIC drivers that old who’s DFSR replication issue was resolved by updating the NIC driver.

  • Did not monitor DFSR

Despite the fact that the data DFSR is moving around is usually mission critical many Admins have no idea what DFSR is doing until they discover a problem. Savvy admins have created their own backlog scripts to monitor their servers but a lot of customers just “hoped for the best”. The DFSR Management Pack has been out for almost a year now (and other versions for much longer). Deploy it and use it so you can detect problems and respond before they become a nightmare. If you can’t use the DFSR Ops Management pack at a minimum write a script to track your backlogs on a daily basis so that you know that DFSR is replicating files or not.

Click here for information on the Ops Manager DFSR Management Pack.

Updated Jan 19, 2011:

  • Making changes to disk storage without first backing up the data

    If you must replace or add hard drive space to your DFSR server it is critical that you have a current backup of the data in case something goes wrong. Any number if things can go wrong the most common being conflict event s created due to accidental changes to a parent folder or unintentional deletion of a parent folder that is replicated to all partners. You must backup your data before starting the changes and maintain the backup until the project is completed. 
     
  • Stopping the DFSR service to temporarily prevent replication

    Sometimes there is a need to temporarily stop replication. The proper way to accomplish this is to set the schedule to no replication for the Replication Group in question. The DFSR service must be running to be able to read updates to the USN Journal. Do not stop the DFSR service for long periods of time (days, weeks) as doing so may cause a journal wrap to occur (if many files are modified, added, or deleted in the meantime). DFSR will recover from the journal wrap but in large deployments this will take a long time and replication will not occur or will happen very slowly during the journal wrap recovery. You may also see very high backlogs until the journal wrap recovery completes.

I hope this list has been a help to you. Happy replicating.

Warren “wide net” Williams