Blog - Title

Ask the Directory Services Team

  • Top 10 Common Causes of Slow Replication with DFSR

    Hi, Ned again. Today I’d like to talk about troubleshooting DFS Replication (i.e. the DFSR service included with Windows Server 2003 R2, not to be confused with the File Replication Service). Specifically, I’ll cover the most common causes of slow replication and what you can do about them.

    Update: Make sure you also read this much newer post  to avoid common mistakes that can lead to instability or poor performance: http://blogs.technet.com/b/askds/archive/2010/11/01/common-dfsr-configuration-mistakes-and-oversights.aspx

    Let’s start with ‘slow’. This loaded word is largely a matter of perception. Maybe DFSR was once much faster and you see it degrading over time? Has it always been too slow for your needs and now you’ve just gotten fed up? What will you consider acceptable performance so that you know when you’ve gotten it fixed? There are some methods that we can use to quantify what ‘slow’ really means:

    · DFSMGMT.MSC Health Reports

    We can use the DFSR Diagnostic Reports to see how big the backlog is between servers and if that indicates a slowdown problem:

    clip_image002

    The generated report will tell you sending and receiving backlogs in an easy to read HTML format.

    · DFSRDIAG.EXE BACKLOG command

    If you’re into the command-line you can use the DFSRDIAG BACKLOG command (with options) to see how behind servers are in replication and if that indicates a slow down. Dfsrdiag is installed when you install DFSR on the server. So for example:

    dfsrdiag backlog /rgname:slowrepro /rfname:slowrf /sendingmember:2003srv13 /receivingmember:2003srv17

    Member <2003srv17> Backlog File Count: 10
    Backlog File Names (first 10 files)
         1. File name: UPDINI.EXE
         2. File name: win2000
         3. File name: setupcl.exe
         4. File name: sysprep.exe
         5. File name: sysprep.inf.pro
         6. File name: sysprep.inf.srv
         7. File name: sysprep_pro.cmd
         8. File name: sysprep_srv.cmd
         9. File name: win2003
         10. File name: setupcl.exe

    This command shows up to the first 100 file names, and also gives an accurate snapshot count. Running it a few times over an hour and give you some basic trends. Note that hotfix 925377 resolves an error you may receive when continuously querying backlog, although you may want to consider installing the more current DFSR.EXE hotfix which is 931685. Review the recommended hotfix list for more information.

    · Performance Monitor with DFSR Counters enabled

    DFSR updates the Perfmon counters on your R2 servers to include three new objects:

    • DFS Replicated Folders
    • DFS Replication Connections
    • DFS Replication Service Volumes

    Using these allows you to see historical and real-time statistics on your replication performance, including things like total files received, staging bytes cleaned up, and file installs retried – all useful in determining what true performance is as opposed to end user perception. Check out the Windows Server 2003 Technical Reference for plenty of detail on Perfmon and visit our sister AskPerf blog.

    · DFSRDIAG.EXE PropagationTest and PropagationReport

    By running DFSRDIAG.EXE you can create test files then measure their replication times in a very granular way. So for example, here I have three DFSR servers – 2003SRV13, 2003SRV16, and 2003SRV17. I can execute from a CMD line:

    dfsrdiag propagationtest /rgname:slowrepro /rfname:slowrf /testfile:canarytest2

    (wait a few minutes)

    dfsrdiag propagationreport /rgname:slowrepro /rfname:slowrf /testfile:canarytest2
    /reportfile:c:\proprep.xml

    PROCESSING MEMBER 2003SRV17 [1 OUT OF 3]
    PROCESSING MEMBER 2003SRV13 [2 OUT OF 3]
    PROCESSING MEMBER 2003SRV16 [3 OUT OF 3]

    Total number of members            : 3
    Number of disabled members         : 0
    Number of unsubscribed members     : 0
    Number of invalid AD member objects: 0
    Test file access failures          : 0
    WMI access failures                : 0
    ID record search failures          : 0
    Test file mismatches               : 0
    Members with valid test file       : 3

    This generates an XML file with time stamps for when a file was created on 2003SRV13 and when it was replicated to the other two nodes.

    clip_image004

    The time stamp is in FILETIME format which we can convert with the W32tm tool included in Windows Server 2003.

    <MemberName>2003srv17</MemberName>
    <CreateTime>128357420888794190</CreateTime>
    <UpdateTime>128357422068608450</UpdateTime>

    w32tm /ntte 128357420888794190
    148561 19:54:48.8794190 - 10/1/2007 3:54:48 PM (local time)

    C:\>w32tm /ntte 128357422068608450
    148561 19:56:46.8608450 - 10/1/2007 3:56:46 PM (local time)

     

    So around two minutes later our file showed up. Incidentally, this is something you can do in the GUI on Windows Server 2008 and it even gives you the replication time in a format designed for human beings!

    clip_image006

    Based on the above steps, let’s say we’re seeing a significant backlog and slower than expected replication of files. Let’s break down the most common causes as seen by MS Support:

    1. Missing Windows Server 2003 Network QFE Hotfixes or Service Pack 2

    Over the course of its lifetime there have been a few hotfixes for Windows Server 2003 that resolved intermittent issues with network connectivity. Those issues generally affected RPC and led to DFSR (which relies heavily on RPC) to be a casualty. To close these loops you can install KB938751 and KB922972 if you are on Service Pack 1 or 2. I highly recommend (in fact, I pretty much demand!) that you also install KB950224 to prevent a variety of DFSR issues - in fact, this hotfix should be on every Win2003 computer in your company.

    2. Missing DFSR Service’s latest binary

    The most recent version of DFSR.EXE always contains updates that not only fix bugs but also generally improve replication performance. We now have a KB article that we are keeping up to date with the latest files we recommend running for DFSR:

    KB 958802 - List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2003 R2
    KB 968429 - List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2

    3. Out-of-date Network Card and Storage drivers

    You would never run Windows Server 2003 with no Service Packs and no security updates, right? So why run it without updated NIC and storage drivers? A large number of performance issues can be resolved by making sure that you keep your drivers current. Trust me when I say that vendors don’t release new binaries at heavy cost to themselves unless there’s a reason for them. Check your vendor web pages at least once a quarter and test test test.

    Important note: If you are in the middle of an initial sync, you should not be rebooting your server! All of the above fixes will require reboots. Wait it out, or assume the risk that you may need to run through initial sync again.

    4. DFSR Staging directory is too small for the amount of data being modified

    DFSR lives and dies by its inbound/outbound Staging directory (stored under <your replicated folder>\dfsrprivate\staging in R2). By default, it has a 4GB elastic quota set that controls the size of files stored there for further replication. Why elastic? Because experience with FRS showed us having a hard-limit quota that prevented replication was A Bad Idea™.

    Why is this quota so important? Because if Staging is below  quota - 90% by default -  it will replicate at the maximum rate of 9 files (5 outbound, 4 inbound) for the entire server. If the staging quota of a replicated folder is exceeded then depending on the number of files currently being replicated for that replicated folder, DFSR may end up slowing replication for the entire server until the staging quota of the replicated folder drops below the low water mark, which is computed by multiplying the staging quota by the low water mark in percent (default is 60%).

    If the staging quota of a replicated folder is exceeded and the current number of inbound replicated files in progress for that replicated folder exceeds 3 (15 in Win2008) then one task is used by staging cleanup and the three (15 in Win2008) remaining tasks are waiting for staging cleanup to complete. Since there is a maximum of four (15 in Win2008) concurrent tasks, no further inbound replication can take place for the entire system.

    If the staging quota of a replicated folder is exceeded and the current number of outbound replicated files in progress for that replicated folder exceeds 5 (16 in Win2008) then the RPC server cannot serve anymore RPC requests, the maximum number of RPC requests being processed at the same time being five (16 in Win2008) and all five (16 in Win2008) requests waiting for staging cleanup to complete.

    You will see DFS replication 4202, 4204, 4206 and 4208 events about this activity and if happens often (multiple times per day) your quota is too small. See the section Optimize the staging folder quota and replication throughput in the Designing Distributed File Systems guidelines for tuning this correctly. You can change the quota using the DFSR Management MMC (dfsmgmt.msc). Select Replication in the left pane, then the Memberships tab in the right pane. Double-click a replicated folder and select the Advanced tab to view or change the Quota (in megabytes) setting. Your event will look like:

    Event Type: Warning
    Event Source: DFSR
    Event Category: None
    Event ID: 4202
    Date: 10/1/2007
    Time: 10:51:59 PM
    User: N/A
    Computer: 2003SRV17
    Description:
    The DFS Replication service has detected that the staging space in use for the
    replicated folder at local path D:\Data\General is above the high watermark. The
    service will attempt to delete the oldest staging files. Performance may be
    affected.

    Additional Information:
    Staging Folder:
    D:\Data\General\DfsrPrivate\Staging\ContentSet{9430D589-0BE2-400C-B39B-D0F2B6CC972E}
    -{A84AAD19-3BE2-4932-B438-D770B54B8216}
    Configured Size: 4096 MB
    Space in Use: 3691 MB
    High Watermark: 90%

    Low Watermark: 60%

    Replicated Folder Name: general
    Replicated Folder ID: 9430D589-0BE2-400C-B39B-D0F2B6CC972E
    Replication Group Name: General
    Replication Group ID: 0FC153F9-CC91-47D0-94AD-65AA0FB6AB3D
    Member ID: A84AAD19-3BE2-4932-B438-D770B54B8216

    5. Bandwidth Throttling or Schedule windows are too aggressive

    If your replication schedule on the Replication Group or the Connections is set to not replicate from 9-5, you can bet replication will appear slow! If you’ve artificially throttled the bandwidth to 16Kbps on a T3 line things will get pokey. You would be surprised at the number of cases we’ve gotten here where one administrator called about slow replication and it turned out that one of his colleagues had made this change and not told him. You can view and adjust these in DFSMGMT.MSC.

    clip_image008

    You can also use the Dfsradmin.exe tool to export the schedule to a text file from the command-line. Like Dfsrdiag.exe, Dfsradmin is installed when you install DFSR on a server.

    Dfsradmin rg export sched /rgname:testrg /file:rgschedule.txt

    You can also export the connection-specific schedules:

    Dfsradmin conn export sched /rgname:testrg /sendmem:fabrikam\2003srv16 /recvmem:fabrikam\2003srv17
    /file:connschedule.txt

    The output is concise but can be un-intuitive. Each row represents a day of the week. Each column represents an hour in the day. A hex value (0-F) represents the bandwidth usage for each 15 min. interval in an hour. F =Full, E=256M, D=128M, C=64M, B=32M, A=16M, 9=8M, 8=4M, 7=2M, 6=1M, 5=512K, 4=256K, 3=128K, 2=64K, 1=16K, 0=No replication. The values are either in megabits per second (M) or kilobits per second (K).

    And a bit more about throttling - DFS Replication does not perform bandwidth sensing. You can configure DFS Replication to use a limited amount of bandwidth on a per-connection basis, and DFS Replication can saturate the link for short periods of time. Also, the bandwidth throttling is not perfectly accurate though it maybe “close enough.” This is because we are trying to throttle bandwidth by throttling our RPC calls. Since DFSR is as high as you can get in the network stack, we are at the mercy of various buffers in lower levels of the stack, including RPC. The net result is that if one analyzes the raw network traffic, it will tend to be extremely ‘bursty’.

    6. Large amounts of sharing violations

    Sharing violations are a fact of life in a distributed network - users open files and gain exclusive WRITE locks in order to modify their data. Periodically those changes are written within NTFS by the application and the USN Change Journal is updated. DFSR Monitors that journal and will attempt to replicate the file, only to find that it cannot because the file is still open. This is a good thing – we wouldn’t want to replicate a file that’s still being modified, naturally.

    With enough sharing violations though, DFSR can start spending more time retrying locked files than it does replicating unlocked ones, to the detriment of performance. If you see a considerable amount of DFS Replication event log entries for 4302 and 4304 like below, you may want to start examining how files are being used.

    Event ID: 4302 Source DFSR Type Warning
    Description
    The DFS Replication service has been repeatedly prevented from replicating a file due to consistent sharing violations encountered on the file. A local sharing violation occurs when the service fails to receive an updated file because the local file is currently in use.

    Additional Information:
    File Path: <drive letter path to folder\subfolder>
    Replicated Folder Root: <drive letter path to folder>
    File ID: {<guid>}-v<version>
    Replicated Folder Name: <folder>
    Replicated Folder ID: <guid2>
    Replication Group Name: <dfs path to folder>
    Replication Group ID: <guid3>
    Member ID: <guid4>

    Many applications can create a large number of spurious sharing violations, because they create temporary files that shouldn’t be replicated. If they have a predictable extension, you can prevent DFSR from trying to replicate them by setting and exception in DFSMGMT.MSC. The default file filter excludes file extensions ~*, *.bak, and *.tmp, so for example the Microsoft Office temporary files (~*) are excluded by default.

    clip_image010

    Some applications will allow you to specify an alternate location for temporary and working files, or will simply follow the working path as specified in their shortcuts. But sometimes, this type of behavior may be unavoidable, and you will be forced to live with it or stop storing that type of data in a DFSR-replicated location. This is why our recommendation is that DFSR be used to store primarily static data, and not highly dynamic files like Roaming Profiles, Redirected Folders, Home Directories, and the like. This also helps with conflict resolution scenarios where the same or multiple users update files on two servers in between replication, and one set of changes is lost.

    7. RDC has been disabled over a WAN link.

    Remote Differential Compression is DFSR’s coolest feature – instead of replicating an entire file like FRS did, it replicates only the changed portions. This means your 20MB spreadsheet that had one row modified might only replicate a few KB over the wire. If you disable RDC though, changing any portion of a files data will cause the entire file to replicate, and if the connection is bandwidth-constrained this can lead to much slower performance. You can set this in DFSMGMT.MSC.

    clip_image012

    As a side note, in an extremely high bandwidth (Gigabit+) scenario where files are changed significantly, it may actually be faster to turn RDC off. Computing RDC signatures and staging that data is computationally expensive, and the CPU time needed to calculate everything may actually be slower than just moving the whole file in that scenario. You really need to test in your environment to see what works for you, using the PerfMon objects and counters included for DFSR.

    8. Incompatible Anti-Virus software or other file system filter drivers

    It’s a problem that goes back to FRS and Windows 2000 in 1999 – some anti-virus applications were simply not written with the concept of file replication in mind. If an AV product uses its own alternate data streams to store ‘this file is scanned and safe’ information, for example, it can cause that file to replicate out even though to an end-user it is completely unchanged. AV software may also quarantine or reanimate files so that older versions reappear and replicate out. Older open-file Backup solutions that don’t use VSS-compliant methods also have filter drivers that can cause this. When you have a few hundred thousand files doing this, replication can definitely slow down!

    You can use Auditing to see if the originating change is coming from the SYSTEM account and not an end user. Be careful here – auditing can be expensive for performance. Also make sure that you are looking at the original change, not the downstream replication change result (which will always come from SYSTEM, since that’s the account running the DFSR service).

    There are only a couple things you can do about this if you find that your AV/Backup software filter drivers are at fault:

    • Don’t scan your Replicated Folders (not a recommended option except for troubleshooting your slow performance).
    • Take a hard line with your vendor about getting this fixed for that particular version. They have often done so in the past, but issues can creep back in over time and newer versions.

    9. File Server Resource Manager (FSRM) configured with quotas/screens that block
    replication.

    So insidious! FSRM is another component that shipped with R2 that can be used to block file types from being copied to a server, or limit the quantity of files. It has no real tie-in to DFSR though, so it’s possible to configure DFSR to replicate all files and FSRM to prevent certain files from being replicated in. Since DFSR keeps retrying, it can lead to backlogs and situations where too much time is spent retrying backlogged files that can never move and slowing up files that could move as a consequence.

    When this is happening, debug logs (%systemroot%\debug\dfsr*.*) will show entries like:

    20070605 09:33:36.440 5456 MEET 1243 <Meet::Install> -> WAIT Error processing update. updateName:teenagersfrommars.mp3 uid:{3806F08C-5D57-41E9-85FF-99924DD0438F}-v333459
    gvsn:{3806F08C-5D57-41E9-85FF-99924DD0438F}-v333459
    connId:{6040D1AC-184D-49DF-8464-35F43218DB78} csName:Users
    csId:{C86E5BCE-7EBF-4F89-8D1D-387EDAE33002} code:5 Error:
    + [Error:5(0x5) <Meet::InstallRename> meet.cpp:2244 5456 W66 Access is denied.]

    Here we can see that teenagersfrommars.mp3 is supposed to be replicated in, but it failed with an Access Denied. If we run the following from CMD on that server:

    filescrn.exe screen list

    We see that…

    File screens on machine 2003SRV17:

    File Screen Path: C:\sharedrf
    Source Template: Block Audio and Video Files (Matches template)
    File Groups: Audio and Video Files (Block)
    Notifications: E-mail, Event Log

    … someone has configured FSRM using the default Audio/Video template which blocks MP3 files and it happens to be against our c:\sharedrf folder we are replicating. To fix this we can do one or more of the following:

    • Make the DFSR filters match the FSRM filters
    • Delete any files that cannot be replicated due to the FSRM rules.
    • Prevent FSRM from actually blocking by switching it from "Active Screening" to “Passive Screening” by using its snap-in. This will generate events and email warnings to the administrator, but not prevent the files from being moved in.

    10. Un-staged or improperly pre-staged data leading to slow initial replication.

    Wake up, this is the last one!

    Sometimes replication is only slow in the initial sync phase. This can have a number of causes:

    • Users are modifying files while initial replication is going on – ideally, you should set up your replication over a change control window like a weekend or overnight.
    • You don’t have the latest DFSR.EXE from #2 above.
    • You have not pre-staged data, or you’ve done it in a way that actually alters the files, forcing the most of or the entire file to replicate initially.

    Here are the recommendations for pre-staging data that will give you the best bang for your buck, so that initial sync flies by and replication can start doing its real day-to-day job:

    (Make sure you have latest DFSR.EXE installed on all nodes before starting!)

    • ROBOCOPY.EXE - works fine as long as you follow the rules in this blog post.
    • XCOPY.EXE - Xcopy with the /X switch will copy the ACL correctly and not modify the files in any way.
    • Windows Backup (NTBACKUP) - The Windows Backup tool by default will restore the ACLs correctly (unless you uncheck the Advanced Restore Option for Restore security setting, which is checked by default) and not modify the files in any way. [Ned - if using NTBACKUP, please examine guidance here]

    I prefer NTBACKUP because it also compresses the data and is less synchronous than XCOPY or ROBOCOPY [Ned - see above]. Some people ask ‘why should I pre-stage, shouldn’t DFSR just take care of all this for me?’. The answer is yes and no: DFSR can handle this, but when you add in all the overhead of effectively every file being ‘modified’ in the database (they are new files as far as DFSR is concerned), a huge volume of data may lead to slow initial replication times. If you take all the heavy lifting out and let DFSR just maintain, things may go far faster for you.

    As always, we welcome your comments and questions,

    - Ned Pyle

  • New DFSR Data Restoration Script

    Hi, Ned here. Just a quick heads up - there is a new DFSR data recovery script posted below. This allows you to restore data from the ConflictAndDeleted or PreExisting folders within DFSR, primarily during disaster recovery. As always, we prefer you use your backup system to do this, as the script is 'at your own risk' and unsupported.

    Updated 10/15/10

    The latest script is now hosted on Code Gallery: http://code.msdn.microsoft.com/restoredfsr

    Update 6/12/14

    Well that gallery is kaput. Now hosted from here.

    This old script is really gross, I recommend instead using our new Windows PowerShell cmdlet Restore-DfsrPreservedFiles instead. You can restore from 8.1 client if you install RSAT, or from a WS2012 R2 server. Can either run locally or just map a drive to \\dfsrserver\h$ or whatever the root drive is, then restore.

    Take a look at http://blogs.technet.com/b/filecab/archive/2013/08/23/dfs-replication-in-windows-server-2012-r2-restoring-conflicted-deleted-and-preexisting-files-with-windows-powershell.aspx for steps on using this cmdlet.

    -----------------------------------------------

    Remember, this script must be run from a CMD prompt using cscript. Don't just double-click it.

    CSCRIPT.EXE RESTOREDFSR.VBS

    The script also requires to edit three paths (your source files, a new destination path, and the XML manifest you are calling) . If you fail to edit those the script will exit with an error:


    '=======================================================================
    ' Section must be operator-edited to provide valid paths
    '=======================================================================

    ' Change path to specify location of XML Manifest
    ' Example 1: "C:\Data\DfsrPrivate\ConflictAndDeletedManifest.xml"
    ' Example 2: "C:\Data\DfsrPrivate\preexistingManifest.xml"

    objXMLDoc.load("C:\your_replicated_folder\DfsrPrivate\ConflictAndDeletedManifest.xml")

    ' Change path to specify location of source files

    ' Example 1: "C:\data\DfsrPrivate\ConflictAndDeleted"
    ' Example 2: "C:\data\DfsrPrivate\preexisting"

    SourceFolder = ("C:\your_replicated_folder\DfsrPrivate\ConflictAndDeleted")

    ' Change path to specify output folder

    OutputFolder = ("c:\your_dfsr_repair_tree")

    '========================================================================

    - Ned Pyle

  • Remote Server Administration Tools (RSAT) Available For Windows 7 Beta

    Ned here. For those testing Windows 7 administration capabilities, this is for you.

    Download here

    This is the list of Windows Server 2008 administration tools which are included in Win7 RSAT Client:

    Server Administration Tools:
    • Server Manager

    Role Administration Tools:
    • Active Directory Certificate Services (AD CS) Tools
    • Active Directory Domain Services (AD DS) Tools
    • Active Directory Lightweight Directory Services (AD LDS) Tools
    • DHCP Server Tools
    • DNS Server Tools
    • File Services Tools
    • Hyper-V Tools
    • Terminal Services Tools

    Feature Administration Tools:
    • BitLocker Password Recovery Viewer
    • Failover Clustering Tools
    • Group Policy Management Tools
    • Network Load Balancing Tools
    • SMTP Server Tools
    • Storage Explorer Tools
    • Storage Manager for SANs Tools
    • Windows System Resource Manager Tools

    If you need any kind of support, head on over to the TechNet forums or drop us a line here.

    - Ned Pyle

  • Get out and push! Getting the most out of DFSR pre-staging

    Hi, Ned here again. Today I am going to explain the inner workings of DFSR pre-staging in Windows Server 2003 R2, debunk some myths, and hand out some best practices. Let’s get started.

    To begin, this is the last time I will say ‘pre-staging’. While the term is commonly used, it’s a bit confusing once you start mixing in terminology like the Staging directories. So from here in I will refer to this as ‘pre-seeding’ and hope that it enters your vernacular.

    Pre-seeding is the act of getting a recent copy of replicated data to a new DFSR downstream node before you add that server to the Replicated Folder content set. This means that we can minimize the amount of data we transfer over the wire during the initial sync process and hopefully have that downstream server be available much quicker than simply letting DFSR copy all the files in their entirety over potentially latent network links. Administrators typically do this with NTBACKUP or ROBOCOPY.

    How Initial Sync works

    Before we can start pre-seeding, we need to understand how this initial sync system works under the covers. The diagram below is grossly simplified, but gets across the gist of the process:

    image

    Take a long look here and tell me if you can see a performance pitfall for pre-seeding. Give up? In step 6 on the upstream server, files need to be added to the staging directory before the downstream server can decide if it needs the whole file, portions of a file, or no file (because they are identical between servers). Even if both servers have identical copies, the staging process must cycle through on the upstream server in order to decide what portions of the file to send. So while very little data will be on the wire when all is said and done, there is some inherent churn time upstream while we decide how to give the downstream server what it needs, and it ends up meaning that initial sync might take longer than expected on the first partner. So how can we improve this?

    How initial sync works with pre-seeding

    First let’s take a look at how things will work on our third and all subsequent DFSR members in a Replication Group:

    image

    Since the staging directory upstream is already packed full of files, a big step is skipped for much of the process and the servers can concentrate on actually moving data or file hashes around. This means things go much faster (keeping in mind that the staging directory is a cache and is finite; the longer one waits, the more likely changes are to push out previously staged data). In one repro I did for this post, I found these results in my virtual server environment :

    Environment:

    • Three Windows Server 2003 Enterprise R2 SP2 servers running in Virtual Server 2005 VM’s on a private virtual network.
    • 4GB staging (the default).
    • 5.7GB data on a separate volume on upstream server.
    • To determine replication time, I measured the difference between DFSR Event Log event 4102 and 4104 (like so):

    Event Type:    Warning
    Event Source:    DFSR
    Event Category:    None
    Event ID:    4102
    Date:        2/8/2008
    Time:        11:40:35 AM
    User:        N/A
    Computer:    2003MEM21
    Description:
    The DFS Replication service initialized the replicated folder at local path e:\dbperf and is waiting to perform initial replication. The replicated folder will remain in this state until it has received replicated data, directly or indirectly, from the designated primary member.

    ====

    Event Type:    Information
    Event Source:    DFSR
    Event Category:    None
    Event ID:    4104
    Date:        2/8/2008
    Time:        11:40:36 AM
    User:        N/A
    Computer:    2003MEM21
    Description:
    The DFS Replication service successfully finished initial replication on the replicated folder at local path e:\dbperf.

    Testing results:

    • New Replication Group with no pre-staging
    • Initial sync took 28 minutes (baseline speed)
       
    • New Replication Group with one downstream server
    • Pre-seeded data with NTBACKUP on the downstream server
    • Initial sync took 24 minutes (~15% faster than baseline)
    • Same replication group with original two servers
       
    • Added a new third DFSR member
    • Pre-seeded data with NTBACKUP on the new downstream server
    • Initial sync took 13 minutes (~55% faster than baseline)

    55% faster is nothing to blow your nose at – and this is just a small amount of low latency data. If you take a very large set of data on a very slow link with high latency then base initial sync could take for example 2 weeks, out of which only 2 hours are spent to stage files and compute hashes, and the rest by sending data across the wire. In this case pre-seeding may be (1 week - 2 hours) / 1 week = 99% faster. As you can see, the fact that data was already staged upstream meant that we spent considerably less time rolling through the staging directory and didn’t spend most of our time verifying the servers are in sync.

    Optimizing pre-seeding

    Go here:

    http://blogs.technet.com/b/askds/archive/2010/09/07/replacing-dfsr-member-hardware-or-os-part-2-pre-seeding.aspx

    To get the most bang for our buck, we can do some of the following to spend the least amount of time populating the staging directory and the most time syncing files:

    • Set the staging directory quota on your hub servers as close to the size of your data as possible. Since hub servers tend to be beefier boxes and certainly closer to home than your remote branches, this isn’t a problem for most administrators. If you have the disk space, a staging quota that is the same size as the data volume will give the absolute best results.
    • When pre-seeding, always use the most recent backup possible and pre-seed off hours. The less data that is in flux in the staging directory while we run through initial replication the better. This may seem like a no-brainer, but customers frequently contact us about slow initial sync that they started at 9AM on a Monday with a terabyte of highly dynamic data!
    • The latest firmware, chipset, network and disk drivers from your hardware vendor will usually give an incremental performance increase (and not just with DFSR performance). You wouldn’t dream of running your servers without service packs and security hotfixes – why wouldn’t you treat your hardware the same way?

    Important Technical Notes (updated 2/28/09)

    1. ROBOCOPY - If you use robocopy.exe to pre-seed your data, ensure that you use the permissions on the replicated folder root (i.e.c:\my_replicated_folder) to be identical on the source and target servers before beginning your robocopy commands. Otherwise when you have robocopy mirror the files and copy the permissions, you will get unnecessary 4412 conflict events and perform redundant replication (your data will be fine). The issue here is in how robocopy.exe handles security inheritence from a root folder, and how that can change the overall hash of a file. So using the command-line /COPYALL /MIR /Z /R:0 is perfectly fine  as long as the permissions on the source and destination folder are *identical*. After pre-seeding your data with robcopy, you can always use ICACLS.EXE to verify and synchronize the security if necessary.

    2A. NTBACKUP (on Win2003 R2) - If you use NTBACKUP to pre-seed your data on a server where it already hosts DFSR data on that same volume (i.e. you are going to use a new Replicated Folder on the E: drive, and some other data was already being replicated to that E: drive), and you plan on restoring from a full disk backup, you need to understand an important behavior. NTBACKUP is aware of DFSR; NTBACKUP will set a restore key under the DFSR services key in the registry (HKLM\System\CurrentControlSet\Services\DFSR\Restore\<date time> and mark the DFSR service with a non-authoritative restore flag for that volume. The DFSR service will be restarted and the Replicated folders on that volume will do a non-authoritative sync. This should not be destructive to data, but it can mean that you could see your downstream server become unresponsive for minutes or hours while it syncs. When DFSR was written the thought was that NTBACKUP would be used for disaster recovery, where you would certainly be suspicious of the data and DFSR jet database and want consistency sync performed at restore time.

    2B. Windows Server Backup (Windows Server 2008 and Windows Server 2008 R2) - same as above but with newer tools. Do not use NTBACKUP to remotely backup or restore WIndows Server 2008 or later. This is unsupported and will mark files HIDDEN and SYSTEM, which you certainly don't want...

    3. XCOPY - The XCOPY /O command works correctly even without having the root folder permissions set identically, unlike robocopy. However it is certainly not as roboust and sophisticated as robocopy in other regards. So Xcopy is a valid option, but maybe not powerful enough for many users. 

    4. Third party solutions - be wary of third party tools and test them carefully before committing to using them for wide-scale pre-seeding. Thekey thing to remember is that the file hash is everything - if DFSR cannot match the upstream and downstream hashes, it will replicate the file on initial sync. This includes file metadata, such as security ACL's (which are not calculated by tools that do checksum calculating). In Windows Server 2008 R2 beta, check out the DFSRDIAG tool to see how we have made this a bit easier for people. If you really need a file hash checking tool, contact us with a support case, we have some internal ones.

    Wrap Up

    Finally – I don’t have numbers here for Windows Server 2008 yet, sorry. I can tell you that DFSR behaves the same way in regards to the staging process. Based on the performance improvements made elsewhere though (specifically the 16 concurrent file downloads combined with asynchronous RPC and IO), it should be much faster, pre-seeded or not; that’s the Win2008 DFSR mandate.

    Happy pollinating,

    - Ned Pyle

  • Accelerating Your IT Career

    Your career isn’t win or lose anymore, it is win or die. The days of guaranteed work, pensions, and sticking with one company for fifty years are gone. Success has returned to something Cro-Magnon man would recognize: if you’re good at what you do, you get to eat.

    I recently spoke to university graduates about their future as new Microsoft engineers. For the first time, that meant organizing my beliefs. It distills four simple pillars: discipline, technical powerhouse, communication, and legacy. In the tradition of Eric Brechner and in honor of Labor Day, I’d like to share my philosophy.

    Discipline

    Learn constantly, not just when life is forcing you. Read every trustworthy article you can get your hands on, before you need to know it; the time to learn AD replication is not when its failure blocks your schema upgrade. Understanding architecture is the key to deploying and troubleshooting any complex system. If you get nothing else from this post, remember that statement - it can alter your life. For Directory Services, start here.

    Don’t be good at one thing - be amazing at a few things, and good at the rest. We all know someone who's the expert on X. He guards X jealously, making sure he is "indispensable.” Notice how he’s always in a lousy mood: he's not allowing anyone to relieve his boredom and he lives in fear that if anyone does, he'll be replaced. Learn several components inside and out. When you get jaded, move on to a few more and give someone else a turn. You'll still be the expert while they learn and if things get gnarly, you can still save the day. Over time, you become remarkable in many areas. Keep your skills up on the rest so that you can pinch hit when needed. Surround yourself with smart people and absorb their knowledge.

    Admit your mistakes. The only thing worse than making a mistake is trying to cover it up. Eventually, everyone is caught or falls under a career-limiting cloud of suspicion. Now colleagues will remember times they trusted you, and won’t make that "mistake" again. Plead guilty and start serving community service, where you help the team fix the glitch.

    Get a grip. It's never as bad as you think. Losing your composure costs you concentration and brainpower. Remaining emotional and depressed makes you a poor engineer, and a lousy person to be around to boot. Learn how to relax so you can get back to business.

    Never surrender. Your career path is a 45-degree angle leading up to infinity, not an arc - arcs come back down! Keep learning, keep practicing, keep refreshing, keep growing. Keep a journal of "I don't know" topics, and then revisit it weekly to see what you've learned. IT makes this easy: it's the most dynamic industry ever created. In my experience, the Peter Principle is usually a self-induced condition and not the true limit of the individual.

    Technical Powerhouse

    Figure out what makes you remember long term. There is a heck-of-a-lot to know when dealing with complex distributed systems - you can't always stop to look things up. Find a recall technique that works for you and practice it religiously. You’re not cramming for a test; you’re building a library in your brain to serve you for fifty years. No amount of learning will help if you can’t put it to good use.

    Be able to repro anything. When I first came to Microsoft, people had fifteen computers at their desk. Thanks to free virtualization, that nonsense is over and you can run as many test environments as you need, all on one PC. "Oh, but Ned, those virtual machines will cost a fortune!" Gimme a break, it’s walking-around money. A lab pays for itself a thousand times every year, thanks to the rewards of your knowledge and time. It's the best investment you can make. Study and memory are powered by experience.

    Know your dependencies. What does the File Replication Service need to work? DNS, LDAP, Kerberos, RPC. What about AD replication? DNS, LDAP, Kerberos, RPC. Interactive user logon? DNS, LDAP, Kerberos, RPC. Windows developers tend to stick with trusted protocols. If you learn the common building blocks of one component, you become good at many other components. That means you can troubleshoot, design, teach, and recognize risks to them all.

    Understand network captures. It's hard to find an IT system talking only to itself. Notepad, maybe (until you save a file to a network share). There are many free network capture tools out there, and they all have their place. Network analysis is often the only way to know how something works between computers, especially when logging and error messages stink - and they usually do. I'd estimate that network analysis solves a quarter of cases worked in my group. Learn by exploring controlled, working scenarios; the differences become simple to spot in failure captures. Your lab is the key.

    Learn at least one scripting language. PowerShell, CMD, VBS, KiXtart, Perl, Python, WinBatch, etc. – any is fine. Show me an IT pro who cannot script and I'll show you one that grinds too many hours and doesn't get the bonus. Besides making your life easier, scripting may save your business someday and therefore, your career. An introductory programming course often helps, as they teach fundamental computer science and logic that applies to all languages. This also makes dependencies easier to grasp.

    Learn how to search and more importantly, how to judge the results. You can't know everything, and that means looking for help. Most people on the Internet are spewing uninformed nonsense, and you must figure out how to filter them. A vendor is probably trustworthy, but only when talking about their own product. TechNet and KB trump random blogs. Stay skeptical with un-moderated message boards and "enthusiast" websites. Naturally, search results from AskDS are to be trusted implicitly. ;-P

    Communication

    Learn how to converse. I don’t mean talk, I mean converse. This is the trickiest of all my advice: how to be both interesting and interested. The hermit geek in the boiler room - that guy does not get promotions, bonuses, or interesting projects. He doesn't gel with a team. He can't explain his plans or convince anyone to proceed with them. He can't even fill the dead air of waiting… and IT troubleshooting is a lot of waiting. Introverts don’t get the opportunities of extroverts. If I could learn to suppress my fear of heights, you can learn to chat.

    Get comfortable teaching. IT is education. You’re instructing business units in the benefits and behavior of software. You're schooling upper management why they should buy new systems or what you did to fix a broken one. You're coaching your colleagues on network configuration, especially if you don’t want to be stuck maintaining them forever. If you can learn to teach effortlessly and likably, a new aspect to your career opens up. Moreover, there's a tremendous side effect: teaching forces you to learn.

    Learn to like an audience. As you rise in IT, the more often you find yourself speaking to larger groups. Over time they become upper management or experienced peers; an intimidating mix. If you let anxiety or poor skills get in the way, your career will stall. Arm yourself with technique and get out in front of people often. It's easier with practice. Do you think Mark Russinovich gets that fat paycheck for his immaculate hair?

    Project positive. Confidence is highly contagious. When the bullets are flying, people want to follow the guy with the plan and the grin. Even if deep down he's quivering with fear, it doesn’t show and he charges forward, knowing that everyone is behind him. People want to be alongside him when the general hands out medals. Self-assurance spreads throughout an organization and you'll be rewarded for it your whole career. Often by managers who "just can't put their finger" on why they like you.

    Be dominant without domineering. One of the hardest things to teach new employees in Microsoft Support is how to control a conference call. You’re on the phone with a half dozen scared customers, bad ideas are flying everywhere, and managers are interrupting for “status updates”. You can’t be rude; you have to herd the cats gently but decisively. Concentration and firmness are paramount. Not backing down comes with confidence. Steering the useless off to harmless tasks lets you focus (making them think the task is important is the sign of an artist). There's no reason to yell or demand; if you sound decisive and have a plan, everyone will get out of the way. They crave your leadership.

    Legacy

    Share everything. Remember "the expert?" He's on a desert island but doesn��t signal passing ships. Share what you learn with your colleagues. Start your own internal company knowledgebase then fill it. Have gab sessions, where you go over interesting topics you learned that week. Talk shop at lunch. Find a reason to hang out with other teams. Set up triages where everyone takes turn teaching the IT department. Not only do you grow relationships, you're leading and following; everyone is improving, and the team is stronger. A tight team won't crumble under pressure later, and that's good for you.

    Did you ever exist? Invent something. Create documentation, construct training, write scripts, and design new distributed systems. Don’t just consume and maintain - build. When the fifty years have passed, leave some proof that you were on this earth. If a project comes down the pipe, volunteer - then go beyond its vision. If no projects are coming, conceive them yourself and push them through. The world is waiting for you to make your mark.

    I used many synonyms in this post, but not once did I say “job.” Jobs end at quitting time. A career is something that wakes you up at midnight with a solution. I can’t guarantee success with these approaches, but they've kept me happy with my IT career for 15 years. I hope they help with yours.

    Ned "good luck, we're all counting on you" Pyle

  • So long and thanks for all the fish

    My time is up.

    It’s been eight years since a friend suggested I join him on a contract at Microsoft Support (thanks Pete). Eight years since I sat sweating in an interview with Steve Taylor, trying desperately to recall the KDC’s listening port (his hint: “German anti-tank gun”). Eight years since I joined 35 new colleagues in a training room and found that despite my opinion, I knew nothing about Active Directory (“Replication of Absent Linked Object References – what the hell have I gotten myself into?”).

    Eight years later, I’m a Senior Support Escalation Engineer, a blogger of some repute, and a seasoned world traveler who instructs other ‘softies about Windows releases. I’ve created thousands of pages of content and been involved in countless support cases and customer conversations. I am the last of those 35 colleagues still here, but there is proof of my existence even so. It’s been the most satisfactory work of my career.

    Just the thought of leaving was scary enough to give me pause – it’s been so long since I knew anything but supporting Windows. It’s a once in a lifetime opportunity though and sometimes you need to reset your career. Now I’ll help create the next generations of Windows Server and the buck will finally stop with me: I’ve been hired as a Program Manager and am on my way to Seattle next week. I’m not leaving Microsoft, just starting a new phase. A phase with a lot more product development, design responsibility, and… meetings. Soooo many meetings.

    There are two types of folks I am going to miss: the first are workmates. Many are support engineers, but also PFEs, Consultants, and TAMs. Even foreigners! Interesting and funny people fill Premier and Commercial Technical Support and make every day here enjoyable, even after the occasional customer assault. There’s nothing like a work environment where you really like your colleagues. I’ve sat next to Dave Fisher since 2004 and he’s made me laugh every single day. He is a brilliant weirdo, like so many other great people here. You all know who you are.

    The other folks are… you. Your comments stayed thought provoking and fresh for five years and 700 posts. Your emails kept me knee deep in mail sacks and articles (I had to learn in order to answer many of them). Your readership has made AskDS into one of the most popular blogs in Microsoft. You unknowingly played an immense part in my career, forcing me to improve my communication; there’s nothing like a few hundred thousand readers to make you learn your craft.

    My time as the so-called “editor in chief” of AskDS is over, but I imagine you will still find me on the Internet in my new role, yammering about things that I think you’ll find interesting. I also have a few posts in the chamber that Jonathan or Mike will unload after I’m gone, and they will keep the site going. AskDS will continue to be a place for unvarnished support information about Windows technologies, where your questions will get answers.

    Thanks for everything, and see you again soon.

    image
    We are looking forward to Seattle’s famous mud puddles

     

    - Ned “42” Pyle

  • The yuck that is "PC Recycle Day" at Microsoft

    Hey all, Ned here again. Still no ETA on Win8 word, and we've already discussed everything else on Earth ( ;-P ) so now I will share with you some insider knowledge of working in Microsoft Charlotte: the quarterly "PC Recycle Day". Here's an example of what I just saw on my way to get some coffee.

    A couple of these are fairly hard to identify unless you are as old as Jonathan. Take a stab at them in the Comments, if you dare to date yourself. If you've used them all, give yourself a pat on the back - you are really close to retirement.

    Update: Woo, a particularly crusty late arrival from the Networking team! They may upset the perennial Setup team favorites here and win it all this year, folks.

     

    Update 2: a funeral pyre for once-dominant protocols

     

    Have a nice weekend,

     

    - Ned "spring chicken" Pyle

     

  • Managing RID Pool Depletion

    Hiya folks, Ned here again. When interviewing a potential support engineer at Microsoft, we usually start with a softball question like “what are the five FSMO roles?” Everyone nails that. Then we ask what each role does. Their face scrunches a bit and they get less assured. “The RID Master… hands out RIDs.” Ok, what are RIDs? “Ehh…”

    That’s trouble, and not just for the interview. Poor understanding of the RID Master prevents you from adding new users, computers, and groups, which can disrupt your business. Uncontrolled RID creation forces you to abandon your domain, which will cost you serious money.

    Today, I discuss how to protect your company from uncontrolled RID pool depletion and keep your domain bustling for decades to come.

    Background

    Relative Identifiers (RID) are the incremental portion of a domain Security Identifier (SID). For instance:

    S-1-5-21-1004336348-1177238915-682003330-2100

    ==>

    S-1-5-Domain Identifier-Relative Identifier

    A SID represents a unique trustee, also known as a "security principal" – typically users, groups, and computers – that Windows uses for access control. Without a matching SID in an access control list, you cannot access a resource or prove your identity. It’s the lynchpin.

    Every domain has a RID Master: a domain controller that hands each DC a pool of 500 RIDs at a time. A domain contains a single RID pool which generates roughly one billion SIDs (because of a 30-bit length, it’s 230 or 1,073,741,823 RIDs). Once issued, RIDs are never reused. You can’t reclaim RIDs after you delete security principals either, as that would lead to unintended access to resources that contained previously issued SIDs.

    Anytime you create a writable DC, it gets 500 new RIDs from the RID Master. Meaning, if you promote 10 domain controllers, you’ve issued 5000 new RIDs. If 8 of those DCs are demoted, then promoted back up, you have now issued 9000 RIDs. If you restore a system state backup onto one of those DCs, you’ve issued 9500 RIDs. The balance of any existing RIDs issued to a DC is never saved – once issued they’re gone forever, even if they aren’t used to create any users. A DC requests more RIDs when it gets low, not just when it is out, so when it grabs another 500 that becomes part of its "standby" pool. When the current pool is empty, the DC switches to the standby pool. Repeat until doomsday.

    Adding more trustees means issuing more blocks of RIDs. When you’ve issued the one billion RIDs, that’s it – your domain cannot create users, groups, computers, or trusts. Your RID Master logs event 16644The maximum domain account identifier value has been reached.” Time for a support case.

    You’re now saying something like, “One billion RIDs? Pffft. I only have a thousand users and we only add fifty a year. My domain is safe.” Maybe. Consider all the normal ways you “issue” RIDs:

    • Creating users, computers, and groups (both Security and email Distribution) as part of normal business operations.
    • The promotion of new DCs.
    • DCs gracefully demoted costs the remaining RID pool.
    • System state restore on a DC invalidates the local RID pool.
    • Active Directory domains upgraded from NT 4.0 inherit all the RIDs from that old environment.
    • Seizing the RID Master FSMO role to another server

    Now study the abnormal ways RIDs are wasted:

    • Provisioning systems or admin scripts that accidentally bulk create users, groups, and computers.
    • Attempting to create enabled users that do not meet password requirements
    • DCs turned off longer than tombstone lifetime.
    • DC metadata cleaned.
    • Forest recovery.
    • The InvalidateRidPool operation.
    • Increasing the RID Block Size registry value.

    The normal operations are out of your control and unlikely to cause problems even in the biggest environments. For example, even though Microsoft’s Redmond AD dates to 1999 and holds the vast majority of our resources, it has only consumed ~8 million RIDs - that's 0.7%. In contrast, some of the abnormal operations can lead to squandered RIDs or even deplete the pool altogether, forcing you to migrate to a new domain or recover your forest. We’ll talk more about them later; regardless of how you are using RIDs, the key to avoiding a problem is observation.

    Monitoring

    You now have a new job, IT professional: monitoring your RID usage and ensuring it stays within expected patterns. KB305475 describes the attributes for both the RID Master and the individual DCs. I recommend giving it a read, as the data storage requires conversion for human consumption.

    Monitoring the RID Master in each domain is adequate and we offer a simple command-line tool I’ve discussed beforeDCDIAG.EXE. Part of Windows Server 2008+ or a free download for 2003, it has a simple test that shows the translated number of allocated RIDs called rIDAvailablePool:

    Dcdiag.exe /test:ridmanager /v

    For example, my RID Master has issued 3100 RIDs to my DCs and itself:

    clip_image001
    image

    If you just want the good bit, perhaps for batching:

    Dcdiag.exe /TEST:RidManager /v | find /i "Available RID Pool for the Domain"

    For PowerShell, here is a slightly modified version of Brad Rutkowski's original sample function. It converts the high and low parts of riDAvailablePool into readable values:

    function Get-RIDsRemaining   

    {

        param ($domainDN)

        $de = [ADSI]"LDAP://CN=RID Manager$,CN=System,$domainDN"

        $return = new-object system.DirectoryServices.DirectorySearcher($de)

        $property= ($return.FindOne()).properties.ridavailablepool

        [int32]$totalSIDS = $($property) / ([math]::Pow(2,32))

        [int64]$temp64val = $totalSIDS * ([math]::Pow(2,32))

        [int32]$currentRIDPoolCount = $($property) - $temp64val

        $ridsremaining = $totalSIDS - $currentRIDPoolCount

        Write-Host "RIDs issued: $currentRIDPoolCount"

        Write-Host "RIDs remaining: $ridsremaining"

    }

    image

    Another sample, if you want to use the Active Directory PowerShell module and target the RID Master directly:

    function Get-RIDsremainingAdPsh

    {

        param ($domainDN)

        $property = get-adobject "cn=rid manager$,cn=system,$domainDN" -property ridavailablepool -server ((Get-ADDomain $domaindn).RidMaster)

        $rid = $property.ridavailablepool   

        [int32]$totalSIDS = $($rid) / ([math]::Pow(2,32))

        [int64]$temp64val = $totalSIDS * ([math]::Pow(2,32))

        [int32]$currentRIDPoolCount = $($rid) - $temp64val

        $ridsremaining = $totalSIDS - $currentRIDPoolCount

        Write-Host "RIDs issued: $currentRIDPoolCount"

        Write-Host "RIDs remaining: $ridsremaining"

    }

    image

    Turn one of those PowerShell samples into a script that runs as a scheduled task that updates a log every morning and alerts you to review it. You can also use LDP.EXE to convert the RID pool values manually every day, if you are an insane person.

    You should also consider monitoring the RID Block Size, as any increase exhausts your global RID pool faster. Object Access Auditing can help here. There are legitimate reasons to increase this value on certain DCs. For example, if you are the US Marine Corps and your DCs are in a warzone where they may not be able to talk to the RID Master for weeks. Be smart about picking values - you are unlikely to need five million RIDs before talking to the master again; when the DC comes home, lower the value back to default.

    The critical review points are:

    1. You don’t see an unexpected rise in RID issuance.
    2. You aren’t close to running out of RIDs.

    Let’s explore what might be consuming RIDs unexpectedly.

    Diagnosis

    If you see a large increase in RID allocation, the first step is finding what was created and when. As always, my examples are PowerShell. You can find plenty of others using VBS, free tools, and whatnot on the Internet.

    You need to return all users, computers, and groups in the domain – even if deleted. You need the SAM account name, creation date, SID, and USN of each trustee. There are going to be a lot of these, so filter the returned properties to save time and export to a CSV file for sorting and filtering in Excel. Here’s a sample (it’s one wrapped line):

    Get-ADObject -Filter 'objectclass -eq "user" -or objectclass -eq "computer" -or objectclass -eq "group"' -properties objectclass,samaccountname,whencreated,objectsid,uSNCreated -includeDeletedObjects | select-object objectclass,samaccountname,whencreated,objectsid,uSNCreated | Export-CSV riduse.csv -NoTypeInformation -Encoding UTF8

    Here I ran the command, then opened in Excel and sorted by newest to oldest:

    image
    Errrp, looks like another episode of “scripts gone wild”…

    Now it’s noodle time:

    • Does the user count match actual + previous user counts (or at least in the ballpark)?
    • Are there sudden, massive blocks of object creation?
    • Is someone creating and deleting objects constantly – or was it just once and you need to examine your audit logs to see who isn’t admitting it?
    • Has your user provisioning system gone berserk (or run by someone who needs… coaching)?
    • Have you changed your password policy and are now trying to create enabled users that do not meet password requirements (this uses up a RID during each failed creation attempt).
    • Do you use a VDI system that constantly creates and deletes computer accounts when provisioning virtual machines - we’ve seen those too: in one case, a third party solution was burning 4 million computer RIDs a month.

    If the RID allocations are growing massively, but you don’t see a subsequent increase in new trustees, it’s likely someone increased RID Block Size inappropriately. Perhaps they set hexadecimal rather than decimal values – instead of the intended 15,000 RIDs per allocation, for example, you’d end up with 86,016!

    It may also be useful to know where the updates are coming from. Examine each DC’s RidAllocationPool for increases to see if something is running on - or pointed at – a specific domain controller.

    Recovery

    You know there’s a problem. The next step is to stop things getting worse (as you have no way to undo the damage without recovering the entire forest).

    If you identified the cause of the RID exhaustion, stop it immediately; your domain’s health is more important. If that system continues in high enough volume, it’s going to force you to abandon your domain.

    If you can’t find the cause and you are anywhere near the end of your one billion RIDs, get a system state backup on the RID Master immediately. Then transfer the RID Master role to a non-essential DC that you shut down to prevent further issuance. The allocated RID pools on your DCs will run out, but that stops further damage. This gives you breathing space to find the bad guy. The downside is that legitimate trustee creation stops also. If you don’t already have a Forest Recovery process in place, you had better get one going. If you cannot figure out what's happening, open a support case with us immediately.

    No matter what, you cannot let the RID pool run out. If you see:

    • SAM Event 16644
    • riDAvailablePool is “4611686015206162431
    • DCDIAG “Available RID Pool for the Domain is 1073741823 of 1073741823

    ... it is too late. Like having a smoke detector that only goes off when the house has burned down. Now you cannot create a trust for a migration to another domain. If you reach that stage, open a support case with us immediately. This is one of those “your job depends on it” issues, so don’t try to be a lone gunfighter.

    Many thanks to Arren “cowboy killer” Connor for his tireless efforts and excellent internal docs around this scenario.

    Finally, a tip: know all the FSMO roles before you interview with us. If you really want to impress, know that the PDC Emulator does more than just “emulate a PDC”. Oy vey.

     

    UPDATE 11/14/2011:

    Our seeds to improve the RID Master have begun growing and here's the first ripe fruit - http://support.microsoft.com/kb/2618669

     

     

    Until next time.

    Ned “you can’t get RID of me that easily” Pyle