Tuning replication performance in DFSR (especially on Win2008 R2)

Tuning replication performance in DFSR (especially on Win2008 R2)

  • Comments 13
  • Likes

Hi all, Ned here again. There are a number of ways that DFSR can be tuned for better performance. This article will go through these configurations and explain the caveats. Even if you cannot deploy Windows Server 2008 R2 - for the absolute best performance - you can at least remove common bottlenecks from your older environments. If you are really serious about performance in higher node count DFSR environments though, Win2008 R2’s 3rd generation DFSR is the answer.

If you’ve been following DFSR for the past few years, you already know about some improvements that were made to performance and scalability starting in Windows Server 2008:

Windows Server 2003 R2

Windows Server 2008

Multiple RPC calls

RPC Async Pipes (when replicating with other servers running Windows Server 2008)

Synchronous inputs/outputs (I/Os)

Asynchronous I/Os

Buffered I/Os

Unbuffered I/Os

Normal Priority I/Os

Low Priority I/Os (this reduces the load on the system as a result of replication)

4 concurrent file downloads

16 concurrent file downloads

But there’s more you can do, especially in 2008 R2.

Registry tuning

All registry values are REG_DWORD (and in the explanations below, are always in decimal). All registry tuning for DFSR in Win2008 and Win2008 R2 is made here:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DFSR\Parameters\Settings

A restart of the DFSR service is required for the settings to take effect, but a reboot is not required. The list below is not complete, but instead covers the important values for performance. Do not assume that setting a value to the max will make it faster; some settings have a practical limitation before other bottlenecks make higher values irrelevant.

Important Note: None of these registry settings apply to Windows Server 2003 R2.

AsyncIoMaxBufferSizeBytes
Default value: 2097152
Possible values: 1048576, 2097152, 4194304, 8388608
Tested high performance value: 8388608
Set on: All DFSR nodes

RpcFileBufferSize
Default value: 262144
Possible values: 262144, 524288
Tested high performance value: 524288
Set on: All DFSR nodes

StagingThreadCount
Default value: 6
(Win2008 R2 only; cannot be changed on Win2008)
Possible values: 4-16
Tested high performance value: 8
Set on: All DFSR nodes. Setting to 16 may generate too much disk IO to be useful.

TotalCreditsMaxCount
Default value: 1024
Possible values: 256-4096
Tested high performance value: 4096
Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow)

UpdateWorkerThreadCount
Default value: 16
Possible values (Win2008): 4-32
Possible values (Win2008 R2): 4-63*
Tested high performance value: 32

Set on: All DFSR nodes that are generally inbound replicating (so hubs if doing data collection, branches if doing data distribution, all servers if using no specific replication flow. The number being raised here is only valuable when replicating in from more servers than the value. I.e. if replicating in 32 servers, set to 32. If replicating in 45 servers set to 45.

*Important note: The actual top limit is 64. We have found that under certain circumstances though, setting to 64 can cause a deadlock that prevents DFSR replication altogether. If you exceed the maximum tested value of 32, set to 63 or lower. Do not set to 64 ever. The 32 max limit is recommended because we tested it carefully, and higher values were not rigorously tested. If you set this value to 64, periodically replication will stop working, the dfsrdiag replstate command hangs and does not return results, and the dfsrdiag backlog command hangs and does not return results.

 

When using all the above registry tuning on Windows Server 2008 R2, testing revealed that initial sync replication time was sometimes twice as fast compared to no registry settings in place. This was using 32 servers replicating a "data collection" topology to a single hub over thirty-two non-LAN networks with 32 RG's containing unique branch office data. The slower the network, the better the relative performance averaged:

Test

Spokes

Hubs

Topology

GB/node

Unique

RG

Tuned

Network

Time to sync

C1

32

1

Collect

1

Yes

32

N

1Gbps

0:57:27

C2

32

1

Collect

1

Yes

32

Y

1Gbps

0:53:09

C3

32

1

Collect

1

Yes

32

N

1.5Mbps

3:31:36

C4

32

1

Collect

1

Yes

32

Y

1.5Mbps

2:24:09

C5

32

1

Collect

1

Yes

32

N

512Kbps

10:56:42

C6

32

1

Collect

1

Yes

32

Y

512Kbps

5:57:09

C7

32

1

Collect

1

Yes

32

N

256Kbps

21:43:02

C8

32

1

Collect

1

Yes

32

Y

256Kbps

10:46:46

On Windows Server 2008 the same registry values showed considerably less performance improvement; this is partly due to additional service improvements made to DFSR in Win2008 R2, especially around the Credit Manager. Just like your phone, “3G” DFSR is going to work better than older models…

Note: do not use this table to predict replication times. It is designed to show behavior trends only!

Topology tuning

Even if you are not using Windows Server 2008 R2 there are plenty of other factors to fast replication. Some of these I’ve talked about before, some are new. All are important:

  • Minimize mixing of Win2003 and Win2008/Win2008 R2 - Windows Server 2008 introduced significant DFSR changes for RPC, inbound and outbound threading, and other aspects. However, if a Win2008 server is partnered with a Win2003 server for DFSR, most of those improvements are disabled for backwards compatibility. An ideal environment is 100% Windows Server 2008 R2, but a Win2008-only is still a huge improvement. Windows Server 2003 should be phased out of use as quickly as possible as it has numerous "1G" design issues that were improved on with experience in later OS's. Windows Server 2008 R2 credit manager and update worker improvements are most efficient when all operating systems are homogenous. If you are replacing Win2003 servers with newer OS, do the hub servers first as the increased number of files will provide some benefits even when talking to 2003 spokes.
  • Consider multiple hubs - If using a large number of branch servers in a hub-and-spoke topology, adding “subsidiary hub” servers will help reduce load on the main hubs.

    So for example, this configuration would cause more bottlenecking:

image

And this configuration would cause less bottlenecking:

image

  • Increase staging quota - The larger the replicated folder staging quotas are for each server, the less often files must be restaged when replicating inbound changes. In a perfect world, staging quota would be configured to match the size of the data being replicated. Since this is typically impossible, it should be made as large as reasonably possible. It must always be configured to be at least as large as the combined size of the count of the files controlled by UpdateWorkerThreadCount+16 on Win2008 and Win2008 R2. Why 16? Because that is the number of outbound files that could be replicated simultaneously.

This means that by default on Win2008/Win2008 R2, quota must be as large as the 32 largest files. If UpdateWorkerThreadCount is increased to 32, it must be as large as the 48 largest files (32+16). If any smaller then staging can become blocked when all 32 files are being replicated inbound and 16 outbound, preventing further replication until that queue is cleared. Frequent 4202 and 4204 staging events are indications of an inappropriately configured staging quota, especially if no longer in the initial sync phase of setting up DFSR for the first time.

Source : DFSR
Catagory : None
Event ID : 4202
Type : Warning
Description :
The DFS Replication service has detected that the staging space in use for the replicated folder at local path c:\foo is above the high watermark. The service will attempt to delete the oldest staging files. Performance may be affected.   

Source : DFSR
Catagory : None
Event ID : 4204
Type : Information
Description :
The DFS Replication service has successfully deleted old staging files for the replicated folder at local path c:\foo. The staging space is now below the high watermark.

If you get 4206 staging events you have really not correctly sized your staging, as you are now blocking replication behind large files.

Event Type: Warning
Event Source: DFSR
Event Category: None
Event ID: 4206
Date: 4/4/2009
Time: 3:57:21 PM
User: N/A
Computer: SRV
Description:
The DFS Replication service failed to clean up old staging files for the replicated folder at local path c:\foo. The service might fail to replicate some large files and the replicated folder might get out of sync. The service will automatically retry staging space cleanup in 1 minutes. The service may start cleanup earlier if it detects some staging files have been unlocked.

If still using Win2003 R2, staging quota would need to be as large as the 9 largest files. And if using read-only replication on Windows Server 2008 R2, at least 16 or the size specified in UpdateWorkerThreadCount – after all, a read-only replicated folder has no outbound replication.

So to recap the staging quota minimum recommendations:

- Windows Server 2003 R2: 9 largest files
- Windows Server 2008: 32 largest files (default registry)
- Windows Server 2008 R2: 32 largest files (default registry)
- Windows Server 2008 R2 Read-Only: 16 largest files

If you want to find the 32 largest files in a replicated folder, here’s a sample PowerShell command:

Get-ChildItem <replicatedfolderpath> -recurse | Sort-Object length -descending | select-object -first 32 | ft name,length -wrap –auto

  • Consider read-only - Deploy Windows Server 2008 R2 read-only replication when possible. If users are not supposed to change data, mark those replicated folders as read-only. A read-only server cannot originate data and will prevent unwanted replication or change orders from occurring outbound to other servers. Unwanted changes generate load and lead to data overwrites – which to fix you will need to replicate back out from backups, consuming time and replication resources.
  • Latest QFE and SP - Always run the latest service pack for that OS, and the latest DFSR.EXE/DFSRS.EXE for that OS. There are also updates for NTFS and other components that DFSR relies on. Hotfixes have been released that remove performance bugs or make DFSR more reliable; a more reliable DFSR is naturally faster too. These are documented in KB968429 and KB958802 but the articles aren’t always perfectly up to date, so here’s a trick: If you want to find the latest DFSR service updates, use these three searches and look for the highest KB number in the results:

Win2008 R2: http://www.bing.com/search?q=%22windows+server+2008+r2%22+%22dfsrs.exe%22+kbqfe+site%3Asupport.microsoft.com&go=&form=QBRE

Win2008: http://www.bing.com/search?q=%22windows+server+2008%22+%22dfsrs.exe%22+kbqfe+site%3Asupport.microsoft.com&form=QBRE&qs=n

Win2003 R2: http://www.bing.com/search?q=%22windows+server+2003+r2%22+%22dfsr.exe%22+kbqfe+site%3Asupport.microsoft.com&form=QBRE&qs=n

Remember, Win2003 mainstream support ends July 13, 2010. That’s the end of non-security updates for that OS.

People ask me all the time why I take such a hard line on DFSR hotfixes. I ask in return “Why don’t you take such a hard line?” These fixes cost us a fortune, we’re not writing them for our health. And that goes for all other components too, not just DFSR. It’s an issue intrinsic to all software. DFSR is not less reliable than many other Windows components – after all, NTFS is considered an extremely reliable file system but that hasn’t stopped it from having 168 hotfixes in its lifetime; DFSR just has a passionate group of Support Engineers and developers here at MS that want you to have the best experience.

  • Turn off RDC on fast connections with mostly smaller files - later testing (not addressed in the chart below) showed 3-4 times faster replication when using LAN-speed networks (i.e. 1GBb or faster) on Win2008 R2. This is because it was faster to send files in their totality than send deltas, when the files were smaller and more dynamic and the network was incredibly fast. The performance improvements were roughly twice as fast on Win2008 non-R2. This should absolutely not be done on WAN networks under 100 Mbit though as it will likely have a very negative affect.
  • Consider and test anti-virus exclusions – Most anti-virus software has no concept of the data types that make up DFSR’s working files and database. Additionally, those file types are not executables and are therefore very unlikely to contain a useful malicious payload. If you are seeing slow performance within DFSR, test the following anti-virus file exclusions; if DFSR performs considerably better, contact your AV vendor for an updated version of their software and an explanation around the performance gap.

<drive>:\system volume information\DFSR\

   $db_normal$
   FileIDTable_* 
   SimilarityTable_*

<drive>:\system volume information\DFSR\database_<guid>\

   $db_dirty$
   Dfsr.db
   Fsr.chk
   *.log
   Fsr*.jrs
   Tmp.edb

<drive>:\system volume information\DFSR\config\

   *.xml

<drive>:\<replicated folder>\dfsrprivate\staging\*

   *.frx

This should be validated carefully; many anti-virus products allow exclusions to be set but then do not actually abide by them. For maximum performance, you would exclude scanning any replicated files at all, but this is obviously unfeasible for most customers.

  • Pre-seed the data when setting up a new replicated folder- Pre-seeding - often referred to as "pre-staging" - data on servers can lead to huge performance gains during initial sync. This is especially useful when creating new branch office servers; if they are being built in the home office, they can be quickly pre-seeded with data then sent out to the field for replication of the change delta. See the following article for pre-seeding recommendations.

Going back to those same tests I showed earlier with 32 spokes replicating back to a single hub, note the average performance behavior when the data was perfectly pre-seeded:

Test

Spokes

Hubs

Topology

GB/node

Unique

RG

Tuned

Staging

Net

Time to sync

C9

32

1

Collect

1

Yes

32

Y

4GB

1Gbps

0:49:21

C11

32

1

Collect

1

Yes

32

Y

4GB

512Kbps

0:46:34

C12

32

1

Collect

1

Yes

32

Y

4GB

256Kbps

0:46:08

C13

32

1

Collect

1

Yes

32

Y

4GB

64Kbps

0:48:29

Even the 64Kbps frame relay connection was nearly as fast as the LAN! This is because no files had to be sent, only file hashes.

Note: do not use this table to predict replication times. It is designed to show behavior trends only.

  • Go native Windows Server 2008 R2 – Not to beat a dead horse but the highest performance gains - including registry tuning and the greatly improved Credit Manager code - will be realized by using Windows Server 2008 R2. Win2003 R2 was first generation DFSR, Win2008 was second generation, and Win2008 R2 is third generation; if you are serious about performance you must get to 2008 R2.

Hardware tuning

  • Use 64-bit OS with as much RAM as possible on hubs - DFSR can become bound by RAM availability on busy hub servers, especially when using the registry performance values above. There is absolutely no reason to run a 32-bit file server in this day and age, and with the coming of Windows Server 2008 R2, it’s no longer possible. For spoke servers that tend to have far less load, you can cut more corners of course; the ten-user sales team in Hicksville doesn’t need 16GB of RAM in their file server.

As a side note, customers periodically open cases to report “memory leaks” in DFSR. What we discuss is that DFSR intentionally caches as much RAM as it can get its hands on – really though, it’s the ESE (Jet) database doing this. So the idler other processes on a DFSR server are, the more memory a DFSR process will be able to gobble up. You can see the same behavior in LSASS’s database on DC’s.

  • Use the fastest disk subsystem you can afford on hubs - Much of DFSR will be disk bound - especially in staging and RDC operations - so high disk throughput will dramatically lower bottlenecks; this is especially true on hub servers. As always, a disk queue length greater than 2 in PerfMon is in indication of an over-used or under-powered disk subsystem. Talk to your hardware vendors about the performance and cost differences of SATA, SCSI, and FC. Don’t forget about reliability too – I have a job here for life thanks to all the customers that use the least expensive, off-brand, no warranty, low parity, practically consumer-grade iSCSI products they can find. You get what you pay for and ultimately your users do not care about anything but their data. The OS is just a thing to make applications access files so that the business can make money. Someday the Linux desktop folks will figure this out and get some applications; then we may actually be in trouble here.

If using iSCSI, make sure you have redundant network paths to the disks, using multiple switches and NIC’s. We have had quite a few cases lately of no fault tolerance iSCSI configs that would go down for hours in the middle of DFSR updating the database and transaction logs, and the results were obviously not pretty.

  • Use reliable networks - They don't necessarily have to be fast, but they do need to stay up. Many DFSR performance issues are caused by using old network card drivers, using malfunctioning "Scalable Network" (TCP offload, RSS, etc.) settings, or using defective WANs. Network card vendors release frequent driver updates to increase performance and resolve problems; just like Windows service packs, the drivers should be installed to improve reliability and performance. Companies often deploy cost saving WAN solutions (with VPN tunnels, frame relay circuits, etc.) that in the end cost the company more in lost productivity than they ever saved in monthly expense. DFSR - like all RPC applications - is sensitive to constant network instability.
  • Review our performance tuning guides – For much more detail on squeezing performance out of your hardware, including network, storage, and the rest, review:

And that’s it.

- Ned “fork” Pyle

  • Thanks Ned!

    There's a ton of useful tips in this post. But listen. Nearly all of them say we have to monitor this, calculate that and so on... Wouldn't it be nice if we could offload this burden to some smart sofware?

    You know what I mean. You have all-new Windows File Server Management Pack for OpsMgr currently in development. Does it make our life easier in respect to the issues and baselines you're talking about here? Or it's 100% IP of AskDS blog and those folks didn't have a chance to make use of it? :)

  • *Great* question Artem. I think a ton of these could be added through the BPA tool that shipped in Win2008 R2. A number of these things (that throw events especially) are being considered for the new FS management pack also. So the answer right now is a qualified yes. :-)

  • Great article and I've been doing some tuning but need some advice. My grand idea was to use Hyper-V and Windows Server 2008 R2 as my hub DFSR servers. We got stuck with Windows NAS and then Windows Unified Data Storage server and now here we sit not able to upgrade to Windows Storage Server 2008. Bygones be bygones and I wanted to give the DFSR work to 2008 R2 Hyper-V machines. I have 100 or so DFS Shares with just over 1TB of user profile and general document data and I'm in the process of moving them to the new DFSR Hyper-V machines. I’m finding that with the maximum tested values outlined the 4 CPU's that I can allocate with Hyper-V spike to 100% and stay there for long periods of time and it's taken a few weeks for the initial replication to catch up. I have had a few reboots in there and switched them to Read Only trying to get the data in there simply populated so I’m sure that isn’t helping things with the speed of the replication.

    My question is what settings, if any, should I back off on given that it looks like I’m CPU bound.

    Any insights would be appreciated.

  • Start by backing off:

    StagingThreadCount

    As that will generate a lot of CPU time (more files being staged leads to more RDC calculation being done simultaneously leads to more CPU time needed). If still high, consider returning these to defaults:

    TotalCreditsMaxCount

    UpdateWorkerThreadCount

    Both mean that more files are being worked on simultaneously. The other two reg values are more about memory than CPU.

  • You're discussing multi hub configuration in this blog. I'm having a question about a possibility. Is this possible.

    It looks a lot like blogs.technet.com/.../image_4.png but then without the 'master' hub.

    I want to replicate a company wide share to 150 branch offices. This share needs to be replicated to and from about 4 central file servers to two DFS hub servers from where it is distributed to the 150 branch offices. I want to equally share the load among these 2 DFS servers. What is the best way of configuring this? Set up 1 RG among all servers, let the Central fileservers replicate with both hubs and divide the brach office servers among the 2 hubs by means of manually editing connections or create 1 RG for the 2 hubs which include the central fileservers and separate RG's for the Branch office servers?

    Please bear in mind that there are already 160 RG's divided between the 2 HUB's for backup purposes of the branch office servers (data is replicated to the hubs from where it is backed up).

    Regards,

    Koen

  • Yes, that would work. Prior to Win2008 R2 - where clustering became possible - that was a fairly common scenario in order to prevent a single point of failure in the hub site taking out the whole topology - in this case it would take out only half; and with clever use of connections that were disabled (so that a branch would only replicate with one hub all th time, unless some disaster necessaitated enabling the alternate hub connections as a partner) you could avoid the 1/2 point of failure.

    So in your specific case, you will need to manually configure your topology. Choose custom when configuring this, not hub/spoke. Then you can make this all work the way you describe.

  • Are these registry entries that are already supposed to be there or ones we need to add?

    (I don't have the "Settings" folder under Parameters)

  • You would create the Settings key and the value names/data yourself. None of that exists by default.

  • Gee Ned,  you almost seem to be saying these registry tweaks put DFSR in a better place in general...at least for initial sync.  Maybe you could put the bug (err DCR) in the product groups ear to change the defaults in Windows next...

    Curious, did you do do any other benchmarking of these tweaks besides init-sync?

  • Indeed.

    I tend to use initial sync because it is the harshest, slowest, most comprehensive test. Further replication is generally less stressful. So... nope.