The Storage Team Blog about file services and storage features in Windows and Windows Server.
A customer recently contacted us about a consistent backlog they were experiencing using DFS Replication. Even on fast links, changes to small files (a few KB) would take 4 to 5 hours to replicate. The customer described the environment as follows:
"We have one hub site server and 25 branch site partners. We use one replication group per site to replicate site-specific data folders and one replication group per site to replicate user-specific folders. We also have another replication group across all sites for common data. This replication however is scheduled to replicate only between 12AM and 4AM. So we end up with 50 replication groups and 50 replicated folders.
Bandwidth is being throttled during the day. There are two replication groups for each site so they are both throttled to the same rate - e.g. a 256K link site will have both connections each throttled to 16KB during the day. Here is a breakdown of the links being used:
One 100MB Link = Full Bandwidth AlwaysOne 10MB Link = 2MB *(7am - 7pm)A large number of 512KB Links = 64K (7am - 7pm)A large number of 256KB Links = 16KB (7am - 7pm)
The total size at the hub server is about 800GB - each branch server ranges from about 10GB up to 100GB. The average file size is difficult to estimate given the huge range - most of the data is Office files ranging from 500KB to 2 MB. Staging area is set to the default of 4096 MB for all replicated folders. A few of the members do log events stating that staging folder cleanup has occurred. They will now be increased.
Note that on most links, bandwidth is throttled only between 7am and 7 PM. We have observed that the backlog seems to clear overnight."
Because the customer reported that the backlog goes down at night when throttling is not set and builds up again during the day when throttling is in effect, we believed that the amount of data modified and replicated was larger than the capacity of the throttled pipe. Some specific observations made by our DFSR gurus who reviewed this case:
We recommended a few tests for the customer to determine whether throttling was the cause of the backlog and slow replication:
The customer reported that disabling bandwidth throttling eliminated the backlogs during working hours and replication was occurring at acceptable speeds. We advised the customer to check backlogs across the servers by using WMI or by scripting the health report (via Dfsradmin) to run off-hours.
--Jill (with many thanks to Shobana Balakrishnan, Dan Boldo, and Roland Nitsch for working with the customer on this case)