A customer recently contacted us about a consistent backlog they were experiencing using DFS Replication. Even on fast links, changes to small files (a few KB) would take 4 to 5 hours to replicate. The customer described the environment as follows:

"We have one hub site server and 25 branch site partners. We use one replication group per site to replicate site-specific data folders and one replication group per site to replicate user-specific folders. We also have another replication group across all sites for common data. This replication however is scheduled to replicate only between 12AM and 4AM. So we end up with 50 replication groups and 50 replicated folders.

Bandwidth is being throttled during the day. There are two replication groups for each site so they are both throttled to the same rate - e.g. a 256K link site will have both connections each throttled to 16KB during the day. Here is a breakdown of the links being used:

One 100MB Link = Full Bandwidth Always
One 10MB Link = 2MB *(7am - 7pm)
A large number of 512KB Links = 64K (7am - 7pm)
A large number of 256KB Links = 16KB (7am - 7pm)

The total size at the hub server is about 800GB - each branch server ranges from about 10GB up to 100GB. The average file size is difficult to estimate given the huge range - most of the data is Office files ranging from 500KB to 2 MB. Staging area is set to the default of 4096 MB for all replicated folders. A few of the members do log events stating that staging folder cleanup has occurred. They will now be increased.

Note that on most links, bandwidth is throttled only between 7am and 7 PM. We have observed that the backlog seems to clear overnight."

Because the customer reported that the backlog goes down at night when throttling is not set and builds up again during the day when throttling is in effect, we believed that the amount of data modified and replicated was larger than the capacity of the throttled pipe. Some specific observations made by our DFSR gurus who reviewed this case:

  • With throttling set at 64 KB and 16 KB, the customer was severely restricting the flow of data for 12 out of 24 hours (i.e., while data is being generated or modified but not allowed to replicate up).
  • Having 12 additional hours at fairly low bandwidth pipes means that DFSR will do some catching up but it can’t perform miracles, and certainly can’t do much with only 16 KB or 64 KB.
  • The amount of space allocated to staging was insufficient and was probably generating churn on the disks as well as probably limiting the benefit the customer was getting from cross-file RDC.
  • If CPU on the hub machines is not consistently high, we would assume the customer was not hitting any connection limits (which are not hard-coded) but rather experiencing backlog and possibly some limit on served files, with throttling the likely culprit. 

We recommended a few tests for the customer to determine whether throttling was the cause of the backlog and slow replication:

  • Instead of having throttled connections at 16 K and 64 K during the day on low-bandwidth connections, adjust the replication schedule to prevent replication during the day on those connections. At night, leave the connections at full bandwidth (not throttled) or throttled depending on the volume of traffic being generated. This will help address the issue known as “starvation,” which can result from keeping several low-bandwidth (throttled) connections open along with high-bandwidth connections. This starvation can cause backlogs even on the high-bandwidth connection, as the customer experienced.
  • Use the diagnostic report, WMI or “dfsrdiag backlog” command to determine if there is a permanent backlog (even in the morning when the replication should have caught up).
  • Use the performance counter "Bytes replicated per second" of the "DFS Replication Connection" object for the inbound connection in question to verify if the configured maximum bandwidth is always exhausted. 

The customer reported that disabling bandwidth throttling eliminated the backlogs during working hours and replication was occurring at acceptable speeds. We advised the customer to check backlogs across the servers by using WMI or by scripting the health report (via Dfsradmin) to run off-hours.

--Jill (with many thanks to Shobana Balakrishnan, Dan Boldo, and Roland Nitsch for working with the customer on this case)