Blog - Title

  • Ask the Directory Services Team

    Introducing Auditing Changes in Windows 2008

    • 5 Comments

    Hi, Dave here. Auditing is a wonderful tool and we highly recommend that everyone use it on their servers.  It can really help out with diagnosing problems and determining the root cause, and of course with protecting your servers and your network.  However, over-auditing your servers can be a bad thing.  The reason that too much auditing is bad is that it can flood the event logs with extraneous information.  This makes finding the events that you DO need very difficult and time consuming.  Not to mention that your security logs can only hold so many events – here in DS, we routinely run up against systems where the security log wraps every few minutes, making those logs almost useless for any troubleshooting.

    The resolution to this problem, of course, is only audit stuff you really need to audit.  That’s easier said than done though.  Or at least it used to be.

    You’re probably familiar with this from Windows Server 2003:

    image

    This is the default audit policy that comes in when you install your first 2003 domain controller.  As you can see, we turn on auditing for several categories.  This generally results in some number of events in the security event logs, but not so much as to cause problems.  Obviously a lot depends on the domain and what people are doing on the network.

    What many administrators do is add to this default policy.  In fact, in order to comply with legal requirements like Sarbanes-Oxley, many companies are required to add to this policy.  It’s often in these scenarios where the amount of auditing on the system begins to cause problems for administrators who need to be able to find and act on events in the logs.

    In Windows 2008 we decided to change the way auditing works to help with this problem.  If you install a brand new Windows 2008 domain controller in a fresh domain, here’s what you get:

    image

    Don’t panic!  You’re still getting an audit policy on your domain controllers; it’s just not visible here.  Open up a command prompt and run auditpol.exe /get /Category:* and here is what you’ll see:

    image

    Obviously we’ve added quite a bit.  You’ll notice that the original top-level categories, like Account Logon and Object Access, are still there.  There are  also a few new categories as well.

    So now we have all these cool new subcategories which you can use to manage auditing in a much more granular way.  This is really helpful for making sure that you’re only auditing the things you need.  For example, if we were trying to troubleshoot a replication problem and wanted to enable some auditing to see what was happening, instead of turning on the entire DS Access category, we would just turn on the subcategory for Directory Service Replication.  Now we can see detail as specific as:

    ===

    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 10/11/2007 6:07:13 PM
    Event ID: 4932
    Task Category: Directory Service Replication
    Level: Information
    Keywords: Audit Success
    User: N/A
    Computer: 2008SRV10.cohowinery.com
    Description:
    Synchronization of a replica of an Active Directory naming context has begun.

    Destination DRA: CN=NTDS Settings,CN=2008SRV10,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=cohowinery,DC=com
    Source DRA: CN=NTDS Settings,CN=2008SRV11,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=cohowinery,DC=com
    Naming Context:    DC=ForestDnsZones,DC=cohowinery,DC=com
    Options: 19
    Session ID: 46
    Start USN: 16449

    ===

    Log Name: Security
    Source: Microsoft-Windows-Security-Auditing
    Date: 10/11/2007 6:07:13 PM
    Event ID: 4933
    Task Category: Directory Service Replication
    Level: Information
    Keywords: Audit Success
    User: N/A
    Computer: 2008SRV10.cohowinery.com
    Description:
    Synchronization of a replica of an Active Directory naming context has ended.

    Destination DRA: CN=NTDS Settings,CN=2008SRV10,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=cohowinery,DC=com
    Source DRA: CN=NTDS Settings,CN=2008SRV11,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=cohowinery,DC=com
    Naming Context:    DC=ForestDnsZones,DC=cohowinery,DC=com
    Options: 19
    Session ID: 46
    End USN: 16483
    Status Code: 0

    ===

    The benefit of doing things this way is that it allows us to get much more specific about what we audit, and the net is that we’ll be able to audit more meaningful information while generating less noise in the event logs.

    To manage the new auditing you will need to use auditpol.exe.  If you try to set up a group policy you’ll only have the top-level categories available and turning one of those on turns on all the subcategories underneath it.  However, all hope is not lost.  You can still centrally administer your audit policies, you just have to do it via a script.  Ned Pyle wrote KB article 921469 that talks about how to do this for Vista – the process is exactly the same for 2008.

    The main reason that the policies only manage the top-level categories is backwards compatibility.  2003 and XP don’t understand the granular auditing subcategories, and their security client-side extensions (the engine that reads the policies and implements the settings) wouldn’t really know what to do with a policy containing those settings.  The amount of code work required to enable backwards compatibility for the policy engine and still allow the policy to manage the new subcategories is pretty extensive, and so the product team wasn’t able to get the change in for 2008 RTM.

    If you’d like to examine all the possible audit events that can be generated in Windows Vista or Windows Server 2008, you can execute the following command (in an elevated CMD prompt as an Administrator):

    wevtutil gp Microsoft-Windows-Security-Auditing /ge /gm:true

    Here’s a snippet from the output:

    event:
        value: 4729
        version: 0
        opcode: 0
        channel: 10
        level: 4
        task: 0
        keywords: 0x8000000000000000
        message: A member was removed from a security-enabled global group.
    Subject:
            Security ID:            %6
            Account Name:           %7
            Account Domain:         %8
            Logon ID:               %9
    Member:
            Security ID:            %2
            Account Name:           %1
    Group:
            Security ID:            %5
            Group Name:             %3
            Group Domain:           %4
    Additional Information:
            Privileges:             %10


    Obviously all of this is just the tip of the iceberg – there are around 360 events here, after all.  You may see us talk about more specific auditing scenarios from time to time here.  You can also find quite a bit more information related to auditing can be found at the Windows Security Logging and Other Esoterica.

    - David Beach

  • Ask the Directory Services Team

    Documenting Active Directory Infrastructure the Easy Way

    • 6 Comments

    Hi, Ned here. From time to time customers ask us what their options are to document their Active Directory environments – site topologies, domains and trusts, where group policies are linked and what their settings are. Until recently we didn’t have an easy way to do this and they were forced to invest a lot of manual labor in creating a map. Today we’ll talk about some free tools we can use to make this task very easy and accurate. I’m going to focus on the most common areas:

    •    Domain and Forest information
    •    OU Structures
    •    Sites
    •    Exchange
    •    Group Policy settings and links

    To do this we’re going to use two automation utilities that you will need to download and install:

    •    Active Directory Topology Diagrammer
    •    Group Policy Management Console

    For the ADTD you will also need Microsoft Office Visio 2003 or 2007 and .NET 2.0 Framework. If you’re using the GPMC that comes with Windows Vista you will need to download the GPMC scripts separately. For this example we’ll assume you’re on XP.

    GPMC is a centralized management and reporting tool for administering group policy. It includes some very useful (and well-hidden) scripts. ADTD is a newly released tool that can interrogate domain controllers about configuration data and create Visio diagrams that document your environment. When combined using the techniques below, that extremely boring and time-consuming documentation project you had in front of you is only going to take hours instead of weeks, leaving you free for more important things.

    So let’s get started:

    1. Install Visio, ADTD, and GPMC on a Windows XP Professional workstation or Windows Server 2003 server.

    2. Start ADTD (it’s called ‘Microsoft Active Directory Topology Diagrammer’ on the Start Menu)

    3. Now we’ll walk through the settings tabs to configure our data collection:

    clip_image002

    Enter in a local (to you) Global Catalog Domain Controller that you can interrogate with the tool. The actual LDAP queries to the GC only take a few seconds in most cases and should not generate any appreciable load – most of the heavy lifting in ADTD is local to your client in Visio. Add your trust settings (if you have more than one domain or multiple forests with trusts). You can also count your users per domains and identify all your GC’s. Using the default of ‘Use DNS and connect to each domain’ means that the tool will also connect to one DC in any trusting domains as well, but again, the amount of data returned will tend to be fairly small.

    image

    On the OU’s tab you can select to draw out all your Organizational Units. Most of the time you’ll want to avoid limiting the depth since your diagram will be incomplete.

    clip_image002[6]

    On the Sites tab you can specify that Site Links, Replication Connections, and subnets are drawn. Avoid using the ‘suppress empty sites’ setting as it’s useful to see locations using Automatic Site Coverage.

    image

    If you’re using Microsoft Exchange the Exchange tab can help diagram your Exchange Organization, where the connections are, the number of mailboxes per server, and even tie them to their logical AD sites so that you know which DC/GC combinations are servicing your messaging infrastructure.

    clip_image002[8]

    If you're using Windows Server 2003 domain or forest-based AD-integrated DNS, you can also opt to show which DC’s are hosting those partitions.

    image

    Finally, with ADTD you can get additional server information such as fully qualified domain names, operating systems and service pack, then color-code them for easier reading. This is especially useful in extremely large, complex environments where DC’s from many different domains are collocated in the same AD site within the same forest.

    4. To execute your query, click Discover. After a few moments it will complete the LDAP lookups and will gray out. Click Draw, and go get a cup of coffee (or lunch, if you’re running hundreds of DC’s) – Visio will crank away creating all of the diagrams for some time. When it’s done, control will return to the ADTD application and you can close it.

    So now we have some Visio diagrams that will be in your My Documents folder (by default; you can change this in ADTD's options menu). In the example below we have:

    • A domain called fabrikam.com with two DC’s and an Exchange server.
    • A child domain called fabchild.fabrikam.com with a single DC.
    • An externally-trusted domain called blueyonderairlines.com
    • An externally-trusted child domain called byachild.blueyonderairlines.com

    So let’s look at what Visio gave us:

    image

    Above is the AD Domains.vsd. It shows our four domains and their trusts. Let’s zoom in on the FABRIKAM domain:

    image

    We have 45 total users on our two DC’s. All the FSMO role holders are identified, as well as the schema version and what domain functional mode we’re in. If we move on to the AD Sites.vsd:

    clip_image002[10]

    We can see that my Fabrikam.com forest has two sites, has several subnets bound to them, and there are connections between the DC’s. Let’s zoom in on that Main-Office site:

    image

    Nifty – we can see the GC’s, the subnet details, the intra and inter-site connections, the Site Link costs and schedule, and even the DC running the ISTG. If you want more detail on all these components check out the highly detailed How Active Directory Replication Topology Works.

    Moving on to the AD Application Partitions.vsd, we can see that only two root domain DC’s are using 2003-style integrated DNS:

    clip_image002[12]

    Since we have an Exchange 2003 server in this environment, Ex Organization.vsd shows us that it has affinity with the Main-Office site.

    clip_image004

    By zooming in we can see that server 2003SRV12 is part of the ‘First Administrative Group’ and is running Exchange 2003 Service Pack 2. It has 32 mailboxes. Any DC/GC lookups it’s doing should be happening against the two DC’s in this site.

    clip_image006

    Finally for ADTD, we come to the OU diagram. The diagrammer can list out all the OU’s (below is a snippet), but other than telling us that that a Group Policy Object is linked to a given location, it doesn’t give much about the policies themselves.

    image

    So here’s where GPMC scripting kicks in:

    1. We open a CMD prompt on our data gathering machine and (assuming we installed to default path) navigate to:

    C:\Program Files\GPMC\Scripts

    2. We type:

    MD c:\GPMCReports

    3. We execute (using our example domain):

    Cscript ListSOMPolicyTree.wsf /domain:fabrikam.com > c:\gpmcreports\fabrikamgpotree.txt

    4. This returns us the c:\gpmcreports\fabrikamgpotree.txt. If we open it we see:

    === GPO Links for domain fabrikam.com ===

    DC=fabrikam
          GPO=Default Domain Policy
          GPO=AllCheck

       OU=Domain Controllers
             GPO=Default Domain Controllers Policy
       OU=csc
       OU=UserRepros
       OU=foo
          OU=rar
       OU=RenamedPuters
       OU=gpotest
       OU=FolderRedir
       OU=RedirectedTest
       OU=nested1
          OU=nested2
                GPO=Logoff Screensaver
             OU=nested3
                OU=nested4
                      GPO=Password Screensaver
       OU=wmi
       OU=Admins
       OU=Exchange
             GPO=No Boot

    === GPO Links for sites in forest DC=fabrikam,DC=com ===

       CN=Main-Office
       CN=Remote-Office

    5. We execute in our command prompt:

    Cscript GetReportsForAllGPOs.wsf c:\gpmcreports /domain:fabrikam.com

    6. This returns all of our policy settings for Fabrikam.com to the c:\gpmcreports folder:

    == Found 6 GPOs in fabrikam.com

    Generating XML report for GPO 'No Boot'
    Generating HTML report for GPO 'No Boot'

    Generating XML report for GPO 'Default Domain Policy'
    Generating HTML report for GPO 'Default Domain Policy'

    Generating XML report for GPO 'Logoff Screensaver'
    Generating HTML report for GPO 'Logoff Screensaver'

    Generating XML report for GPO 'Password Screensaver'
    Generating HTML report for GPO 'Password Screensaver'

    Generating XML report for GPO 'Default Domain Controllers Policy'
    Generating HTML report for GPO 'Default Domain Controllers Policy'

    Generating XML report for GPO 'AllCheck'
    Generating HTML report for GPO 'AllCheck'

    Report generation succeeded for 12 reports.
    Report generation failed for 0 reports.

    7. If we open one of these HTML reports, we can see everything there is to know about that policy. For example, we’ll open the ‘Password Screensaver’ GPO which is linked at OU ‘nested4’; there’s great stuff here…

    clip_image002[14]

    Like settings detail above.

    clip_image004[6]

    Or version history and status.

    clip_image006[5]

    Or delegation.

    Since all this is HTML and XML, you could simply link these live into your OU VSD’s, or get fancier and automate the importation of XML data (using Visio skills far better than mine!). Worst case you’re doing a little copying and pasting from the fabrikamgpotree.txt to update your GPO information instead of hand-crafting thousands objects and settings.

    Now all you need is that $20,000 color plotter so you can print out your diagram wall-sized…

    Further useful group policy information:

    Group Policy Management Console Scripting Samples

    GP Settings Reference

    Thanks for your time. If you come up with some slick Visio tricks please comment here and share them with the world!

    - Ned Pyle

  • Ask the Directory Services Team

    Top 10 Common Causes of Slow Replication with DFSR

    • 241 Comments

    Hi, Ned again. Today I’d like to talk about troubleshooting DFS Replication (i.e. the DFSR service included with Windows Server 2003 R2, not to be confused with the File Replication Service). Specifically, I’ll cover the most common causes of slow replication and what you can do about them.

    Update: Make sure you also read this much newer post  to avoid common mistakes that can lead to instability or poor performance: http://blogs.technet.com/b/askds/archive/2010/11/01/common-dfsr-configuration-mistakes-and-oversights.aspx

    Let’s start with ‘slow’. This loaded word is largely a matter of perception. Maybe DFSR was once much faster and you see it degrading over time? Has it always been too slow for your needs and now you’ve just gotten fed up? What will you consider acceptable performance so that you know when you’ve gotten it fixed? There are some methods that we can use to quantify what ‘slow’ really means:

    · DFSMGMT.MSC Health Reports

    We can use the DFSR Diagnostic Reports to see how big the backlog is between servers and if that indicates a slowdown problem:

    clip_image002

    The generated report will tell you sending and receiving backlogs in an easy to read HTML format.

    · DFSRDIAG.EXE BACKLOG command

    If you’re into the command-line you can use the DFSRDIAG BACKLOG command (with options) to see how behind servers are in replication and if that indicates a slow down. Dfsrdiag is installed when you install DFSR on the server. So for example:

    dfsrdiag backlog /rgname:slowrepro /rfname:slowrf /sendingmember:2003srv13 /receivingmember:2003srv17

    Member <2003srv17> Backlog File Count: 10
    Backlog File Names (first 10 files)
         1. File name: UPDINI.EXE
         2. File name: win2000
         3. File name: setupcl.exe
         4. File name: sysprep.exe
         5. File name: sysprep.inf.pro
         6. File name: sysprep.inf.srv
         7. File name: sysprep_pro.cmd
         8. File name: sysprep_srv.cmd
         9. File name: win2003
         10. File name: setupcl.exe

    This command shows up to the first 100 file names, and also gives an accurate snapshot count. Running it a few times over an hour and give you some basic trends. Note that hotfix 925377 resolves an error you may receive when continuously querying backlog, although you may want to consider installing the more current DFSR.EXE hotfix which is 931685. Review the recommended hotfix list for more information.

    · Performance Monitor with DFSR Counters enabled

    DFSR updates the Perfmon counters on your R2 servers to include three new objects:

    • DFS Replicated Folders
    • DFS Replication Connections
    • DFS Replication Service Volumes

    Using these allows you to see historical and real-time statistics on your replication performance, including things like total files received, staging bytes cleaned up, and file installs retried – all useful in determining what true performance is as opposed to end user perception. Check out the Windows Server 2003 Technical Reference for plenty of detail on Perfmon and visit our sister AskPerf blog.

    · DFSRDIAG.EXE PropagationTest and PropagationReport

    By running DFSRDIAG.EXE you can create test files then measure their replication times in a very granular way. So for example, here I have three DFSR servers – 2003SRV13, 2003SRV16, and 2003SRV17. I can execute from a CMD line:

    dfsrdiag propagationtest /rgname:slowrepro /rfname:slowrf /testfile:canarytest2

    (wait a few minutes)

    dfsrdiag propagationreport /rgname:slowrepro /rfname:slowrf /testfile:canarytest2
    /reportfile:c:\proprep.xml

    PROCESSING MEMBER 2003SRV17 [1 OUT OF 3]
    PROCESSING MEMBER 2003SRV13 [2 OUT OF 3]
    PROCESSING MEMBER 2003SRV16 [3 OUT OF 3]

    Total number of members            : 3
    Number of disabled members         : 0
    Number of unsubscribed members     : 0
    Number of invalid AD member objects: 0
    Test file access failures          : 0
    WMI access failures                : 0
    ID record search failures          : 0
    Test file mismatches               : 0
    Members with valid test file       : 3

    This generates an XML file with time stamps for when a file was created on 2003SRV13 and when it was replicated to the other two nodes.

    clip_image004

    The time stamp is in FILETIME format which we can convert with the W32tm tool included in Windows Server 2003.

    <MemberName>2003srv17</MemberName>
    <CreateTime>128357420888794190</CreateTime>
    <UpdateTime>128357422068608450</UpdateTime>

    w32tm /ntte 128357420888794190
    148561 19:54:48.8794190 - 10/1/2007 3:54:48 PM (local time)

    C:\>w32tm /ntte 128357422068608450
    148561 19:56:46.8608450 - 10/1/2007 3:56:46 PM (local time)

     

    So around two minutes later our file showed up. Incidentally, this is something you can do in the GUI on Windows Server 2008 and it even gives you the replication time in a format designed for human beings!

    clip_image006

    Based on the above steps, let’s say we’re seeing a significant backlog and slower than expected replication of files. Let’s break down the most common causes as seen by MS Support:

    1. Missing Windows Server 2003 Network QFE Hotfixes or Service Pack 2

    Over the course of its lifetime there have been a few hotfixes for Windows Server 2003 that resolved intermittent issues with network connectivity. Those issues generally affected RPC and led to DFSR (which relies heavily on RPC) to be a casualty. To close these loops you can install KB938751 and KB922972 if you are on Service Pack 1 or 2. I highly recommend (in fact, I pretty much demand!) that you also install KB950224 to prevent a variety of DFSR issues - in fact, this hotfix should be on every Win2003 computer in your company.

    2. Missing DFSR Service’s latest binary

    The most recent version of DFSR.EXE always contains updates that not only fix bugs but also generally improve replication performance. We now have a KB article that we are keeping up to date with the latest files we recommend running for DFSR:

    KB 958802 - List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2003 R2
    KB 968429 - List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2

    3. Out-of-date Network Card and Storage drivers

    You would never run Windows Server 2003 with no Service Packs and no security updates, right? So why run it without updated NIC and storage drivers? A large number of performance issues can be resolved by making sure that you keep your drivers current. Trust me when I say that vendors don’t release new binaries at heavy cost to themselves unless there’s a reason for them. Check your vendor web pages at least once a quarter and test test test.

    Important note: If you are in the middle of an initial sync, you should not be rebooting your server! All of the above fixes will require reboots. Wait it out, or assume the risk that you may need to run through initial sync again.

    4. DFSR Staging directory is too small for the amount of data being modified

    DFSR lives and dies by its inbound/outbound Staging directory (stored under <your replicated folder>\dfsrprivate\staging in R2). By default, it has a 4GB elastic quota set that controls the size of files stored there for further replication. Why elastic? Because experience with FRS showed us having a hard-limit quota that prevented replication was A Bad Idea™.

    Why is this quota so important? Because if Staging is below  quota - 90% by default -  it will replicate at the maximum rate of 9 files (5 outbound, 4 inbound) for the entire server. If the staging quota of a replicated folder is exceeded then depending on the number of files currently being replicated for that replicated folder, DFSR may end up slowing replication for the entire server until the staging quota of the replicated folder drops below the low water mark, which is computed by multiplying the staging quota by the low water mark in percent (default is 60%).

    If the staging quota of a replicated folder is exceeded and the current number of inbound replicated files in progress for that replicated folder exceeds 3 (15 in Win2008) then one task is used by staging cleanup and the three (15 in Win2008) remaining tasks are waiting for staging cleanup to complete. Since there is a maximum of four (15 in Win2008) concurrent tasks, no further inbound replication can take place for the entire system.

    If the staging quota of a replicated folder is exceeded and the current number of outbound replicated files in progress for that replicated folder exceeds 5 (16 in Win2008) then the RPC server cannot serve anymore RPC requests, the maximum number of RPC requests being processed at the same time being five (16 in Win2008) and all five (16 in Win2008) requests waiting for staging cleanup to complete.

    You will see DFS replication 4202, 4204, 4206 and 4208 events about this activity and if happens often (multiple times per day) your quota is too small. See the section Optimize the staging folder quota and replication throughput in the Designing Distributed File Systems guidelines for tuning this correctly. You can change the quota using the DFSR Management MMC (dfsmgmt.msc). Select Replication in the left pane, then the Memberships tab in the right pane. Double-click a replicated folder and select the Advanced tab to view or change the Quota (in megabytes) setting. Your event will look like:

    Event Type: Warning
    Event Source: DFSR
    Event Category: None
    Event ID: 4202
    Date: 10/1/2007
    Time: 10:51:59 PM
    User: N/A
    Computer: 2003SRV17
    Description:
    The DFS Replication service has detected that the staging space in use for the
    replicated folder at local path D:\Data\General is above the high watermark. The
    service will attempt to delete the oldest staging files. Performance may be
    affected.

    Additional Information:
    Staging Folder:
    D:\Data\General\DfsrPrivate\Staging\ContentSet{9430D589-0BE2-400C-B39B-D0F2B6CC972E}
    -{A84AAD19-3BE2-4932-B438-D770B54B8216}
    Configured Size: 4096 MB
    Space in Use: 3691 MB
    High Watermark: 90%

    Low Watermark: 60%

    Replicated Folder Name: general
    Replicated Folder ID: 9430D589-0BE2-400C-B39B-D0F2B6CC972E
    Replication Group Name: General
    Replication Group ID: 0FC153F9-CC91-47D0-94AD-65AA0FB6AB3D
    Member ID: A84AAD19-3BE2-4932-B438-D770B54B8216

    5. Bandwidth Throttling or Schedule windows are too aggressive

    If your replication schedule on the Replication Group or the Connections is set to not replicate from 9-5, you can bet replication will appear slow! If you’ve artificially throttled the bandwidth to 16Kbps on a T3 line things will get pokey. You would be surprised at the number of cases we’ve gotten here where one administrator called about slow replication and it turned out that one of his colleagues had made this change and not told him. You can view and adjust these in DFSMGMT.MSC.

    clip_image008

    You can also use the Dfsradmin.exe tool to export the schedule to a text file from the command-line. Like Dfsrdiag.exe, Dfsradmin is installed when you install DFSR on a server.

    Dfsradmin rg export sched /rgname:testrg /file:rgschedule.txt

    You can also export the connection-specific schedules:

    Dfsradmin conn export sched /rgname:testrg /sendmem:fabrikam\2003srv16 /recvmem:fabrikam\2003srv17
    /file:connschedule.txt

    The output is concise but can be un-intuitive. Each row represents a day of the week. Each column represents an hour in the day. A hex value (0-F) represents the bandwidth usage for each 15 min. interval in an hour. F =Full, E=256M, D=128M, C=64M, B=32M, A=16M, 9=8M, 8=4M, 7=2M, 6=1M, 5=512K, 4=256K, 3=128K, 2=64K, 1=16K, 0=No replication. The values are either in megabits per second (M) or kilobits per second (K).

    And a bit more about throttling - DFS Replication does not perform bandwidth sensing. You can configure DFS Replication to use a limited amount of bandwidth on a per-connection basis, and DFS Replication can saturate the link for short periods of time. Also, the bandwidth throttling is not perfectly accurate though it maybe “close enough.” This is because we are trying to throttle bandwidth by throttling our RPC calls. Since DFSR is as high as you can get in the network stack, we are at the mercy of various buffers in lower levels of the stack, including RPC. The net result is that if one analyzes the raw network traffic, it will tend to be extremely ‘bursty’.

    6. Large amounts of sharing violations

    Sharing violations are a fact of life in a distributed network - users open files and gain exclusive WRITE locks in order to modify their data. Periodically those changes are written within NTFS by the application and the USN Change Journal is updated. DFSR Monitors that journal and will attempt to replicate the file, only to find that it cannot because the file is still open. This is a good thing – we wouldn’t want to replicate a file that’s still being modified, naturally.

    With enough sharing violations though, DFSR can start spending more time retrying locked files than it does replicating unlocked ones, to the detriment of performance. If you see a considerable amount of DFS Replication event log entries for 4302 and 4304 like below, you may want to start examining how files are being used.

    Event ID: 4302 Source DFSR Type Warning
    Description
    The DFS Replication service has been repeatedly prevented from replicating a file due to consistent sharing violations encountered on the file. A local sharing violation occurs when the service fails to receive an updated file because the local file is currently in use.

    Additional Information:
    File Path: <drive letter path to folder\subfolder>
    Replicated Folder Root: <drive letter path to folder>
    File ID: {<guid>}-v<version>
    Replicated Folder Name: <folder>
    Replicated Folder ID: <guid2>
    Replication Group Name: <dfs path to folder>
    Replication Group ID: <guid3>
    Member ID: <guid4>

    Many applications can create a large number of spurious sharing violations, because they create temporary files that shouldn’t be replicated. If they have a predictable extension, you can prevent DFSR from trying to replicate them by setting and exception in DFSMGMT.MSC. The default file filter excludes file extensions ~*, *.bak, and *.tmp, so for example the Microsoft Office temporary files (~*) are excluded by default.

    clip_image010

    Some applications will allow you to specify an alternate location for temporary and working files, or will simply follow the working path as specified in their shortcuts. But sometimes, this type of behavior may be unavoidable, and you will be forced to live with it or stop storing that type of data in a DFSR-replicated location. This is why our recommendation is that DFSR be used to store primarily static data, and not highly dynamic files like Roaming Profiles, Redirected Folders, Home Directories, and the like. This also helps with conflict resolution scenarios where the same or multiple users update files on two servers in between replication, and one set of changes is lost.

    7. RDC has been disabled over a WAN link.

    Remote Differential Compression is DFSR’s coolest feature – instead of replicating an entire file like FRS did, it replicates only the changed portions. This means your 20MB spreadsheet that had one row modified might only replicate a few KB over the wire. If you disable RDC though, changing any portion of a files data will cause the entire file to replicate, and if the connection is bandwidth-constrained this can lead to much slower performance. You can set this in DFSMGMT.MSC.

    clip_image012

    As a side note, in an extremely high bandwidth (Gigabit+) scenario where files are changed significantly, it may actually be faster to turn RDC off. Computing RDC signatures and staging that data is computationally expensive, and the CPU time needed to calculate everything may actually be slower than just moving the whole file in that scenario. You really need to test in your environment to see what works for you, using the PerfMon objects and counters included for DFSR.

    8. Incompatible Anti-Virus software or other file system filter drivers

    It’s a problem that goes back to FRS and Windows 2000 in 1999 – some anti-virus applications were simply not written with the concept of file replication in mind. If an AV product uses its own alternate data streams to store ‘this file is scanned and safe’ information, for example, it can cause that file to replicate out even though to an end-user it is completely unchanged. AV software may also quarantine or reanimate files so that older versions reappear and replicate out. Older open-file Backup solutions that don’t use VSS-compliant methods also have filter drivers that can cause this. When you have a few hundred thousand files doing this, replication can definitely slow down!

    You can use Auditing to see if the originating change is coming from the SYSTEM account and not an end user. Be careful here – auditing can be expensive for performance. Also make sure that you are looking at the original change, not the downstream replication change result (which will always come from SYSTEM, since that’s the account running the DFSR service).

    There are only a couple things you can do about this if you find that your AV/Backup software filter drivers are at fault:

    • Don’t scan your Replicated Folders (not a recommended option except for troubleshooting your slow performance).
    • Take a hard line with your vendor about getting this fixed for that particular version. They have often done so in the past, but issues can creep back in over time and newer versions.

    9. File Server Resource Manager (FSRM) configured with quotas/screens that block
    replication.

    So insidious! FSRM is another component that shipped with R2 that can be used to block file types from being copied to a server, or limit the quantity of files. It has no real tie-in to DFSR though, so it’s possible to configure DFSR to replicate all files and FSRM to prevent certain files from being replicated in. Since DFSR keeps retrying, it can lead to backlogs and situations where too much time is spent retrying backlogged files that can never move and slowing up files that could move as a consequence.

    When this is happening, debug logs (%systemroot%\debug\dfsr*.*) will show entries like:

    20070605 09:33:36.440 5456 MEET 1243 <Meet::Install> -> WAIT Error processing update. updateName:teenagersfrommars.mp3 uid:{3806F08C-5D57-41E9-85FF-99924DD0438F}-v333459
    gvsn:{3806F08C-5D57-41E9-85FF-99924DD0438F}-v333459
    connId:{6040D1AC-184D-49DF-8464-35F43218DB78} csName:Users
    csId:{C86E5BCE-7EBF-4F89-8D1D-387EDAE33002} code:5 Error:
    + [Error:5(0x5) <Meet::InstallRename> meet.cpp:2244 5456 W66 Access is denied.]

    Here we can see that teenagersfrommars.mp3 is supposed to be replicated in, but it failed with an Access Denied. If we run the following from CMD on that server:

    filescrn.exe screen list

    We see that…

    File screens on machine 2003SRV17:

    File Screen Path: C:\sharedrf
    Source Template: Block Audio and Video Files (Matches template)
    File Groups: Audio and Video Files (Block)
    Notifications: E-mail, Event Log

    … someone has configured FSRM using the default Audio/Video template which blocks MP3 files and it happens to be against our c:\sharedrf folder we are replicating. To fix this we can do one or more of the following:

    • Make the DFSR filters match the FSRM filters
    • Delete any files that cannot be replicated due to the FSRM rules.
    • Prevent FSRM from actually blocking by switching it from "Active Screening" to “Passive Screening” by using its snap-in. This will generate events and email warnings to the administrator, but not prevent the files from being moved in.

    10. Un-staged or improperly pre-staged data leading to slow initial replication.

    Wake up, this is the last one!

    Sometimes replication is only slow in the initial sync phase. This can have a number of causes:

    • Users are modifying files while initial replication is going on – ideally, you should set up your replication over a change control window like a weekend or overnight.
    • You don’t have the latest DFSR.EXE from #2 above.
    • You have not pre-staged data, or you’ve done it in a way that actually alters the files, forcing the most of or the entire file to replicate initially.

    Here are the recommendations for pre-staging data that will give you the best bang for your buck, so that initial sync flies by and replication can start doing its real day-to-day job:

    (Make sure you have latest DFSR.EXE installed on all nodes before starting!)

    • ROBOCOPY.EXE - works fine as long as you follow the rules in this blog post.
    • XCOPY.EXE - Xcopy with the /X switch will copy the ACL correctly and not modify the files in any way.
    • Windows Backup (NTBACKUP) - The Windows Backup tool by default will restore the ACLs correctly (unless you uncheck the Advanced Restore Option for Restore security setting, which is checked by default) and not modify the files in any way. [Ned - if using NTBACKUP, please examine guidance here]

    I prefer NTBACKUP because it also compresses the data and is less synchronous than XCOPY or ROBOCOPY [Ned - see above]. Some people ask ‘why should I pre-stage, shouldn’t DFSR just take care of all this for me?’. The answer is yes and no: DFSR can handle this, but when you add in all the overhead of effectively every file being ‘modified’ in the database (they are new files as far as DFSR is concerned), a huge volume of data may lead to slow initial replication times. If you take all the heavy lifting out and let DFSR just maintain, things may go far faster for you.

    As always, we welcome your comments and questions,

    - Ned Pyle

  • Ask the Directory Services Team

    Not enough storage is available to complete this operation

    • 2 Comments

    Wait, don't leave!!! 

    We get a lot of hits thanks to Live and Google searches sending people our way when they search on this error. We highly recommend you follow this link to find all the various causes of this error (it uses the site:microsoft.com option to narrow your results and the exact error verbiage) :

    Return all exact Microsoft site articles on this error via Live Search

    Ok, back to our blog post:

    =================================================== 

    Last week I had a customer that was witnessing the following error each time he attempted to manage his Active Directory environment - including attempting to simply add a domain account to his local client machines Administrators group.

    "Not enough storage is available to complete this operation."

     He had confirmed that he was receiving this same pop-up error message on each and every one of his 2003 SP1 Domain Controllers.

    I began troubleshooting this issue as an unresponsive/out-of-resources type of issue in reference to the 'not enough storage' part of the error message.  In doing such, I requested and retrieved both 'NetStat -anb' (http://support.microsoft.com/default.aspx?scid=kb;EN-US;137984) and Server Performance Advisor (SPA) output.   

    Note:  If you're not familiar with the SPA tool, the following blog site for 'A Day at the SPA' is helpful - http://blogs.technet.com/ad/  

    Upon reviewing the information gathered by these tools, it was apparent to me that the error being returned made this not necessarily an unresponsive/out-of-resources type of issue as first suspected.  Specifically, when I reviewed the SPA data and Netstat output it was evident that there was not a resource or bottleneck problem at hand when the issue occurs.

    However, the only reference to the error returned with a search of the KnowledgeBase, included an issue w/ MaxTokenSize - 935744 (http://support.microsoft.com/default.aspx?scid=kb;EN-US;935744).    So that wasn’t quite it either.

    I then located article 913003 - "The Offer Remote Assistance HelpersCNF group is created on domain controllers that have the SMS 2003 Advanced Client installed and that have the Remote Control Agent service enabled" (http://support.microsoft.com/default.aspx?scid=kb;EN-US;913003). 

    After reviewing this I then confirmed with my customer that they had just recently upgraded their version of Microsoft Systems Management Server (SMS) 2003 Advanced Client and had the Remote Control Agent service enabled (as the article details).  Accessing the Users container within the Active Directory Users and Computers snap-in, we had several (322 actually) 'Offer Remote Assistance HelpersCNF' domain groups.

    Upon following through with the article in disabling both the 'Solicited Remote Assistance' and 'Offer Remote Assistance' settings within the Default Domain Controllers Group Policy, we deleted all of the CNF objects.  Once deleted and the deletion replicated around to the other Domain Controllers, we were no longer witnessing the - "Not enough storage is available to complete this operation" error.

     Problem solved: our customer could now add a domain account to his local client machines Administrators group.

  • Ask the Directory Services Team

    Where’s my file? Root cause analysis of FRS and DFSR data deletion

    • 9 Comments

    Hi, Ned here. In the Directory Services support space here at Microsoft, we are often contacted by customers for disaster recovery scenarios. We’re also brought in for deeper forensic analysis of what lead to a problem. Today we’re going to talk about a situation that covers both:

    • A customer has seen some critical data go missing.
    • That data was replicated via the File Replication Service (FRS) or the Distributed File System Replication (DFSR) Service.
    • Before they restore the data with their backup copy, they want to have root cause on who deleted what and where it started. We can’t do this after restoring data because our whole audit trail will of course be destroyed within the respective JET databases.

    FRS Deletion Forensics – The Where and When

    You need to start by determining the name of some folder or file that has been deleted. It's important that this be exact as we will be using it to search. You will need the full original path since it is possible just the name could be duplicated throughout the content set.

    • For this example we have three servers called 2003SRV13, 2003SRV16, and 2003SRV17.
    • We have a folder called c:\frstestlink\importantfolder13 that has been deleted.
    • It contained a file called c:\frstestlink\importantfolder13\importantfile13.doc which was deleted (naturally). Our folder could contain thousands of files but we just need to know one. That’s easy, someone is screaming at you that it’s missing. :)

    Install FRSDIAG on any server that participated in the FRS content set where data was deleted.

    Open a CMD prompt and navigate to the FRSDIAG directory. This will default to:

    c:\program files\windows resource kit\tools\frsdiag

    You’ll see that we have a very useful utility called NTFRSUTL.EXE. Running it with /? will show you its options:

    ntfrsutl [idtable| configtable | inlog | outlog] [computer] = enumerate the service's idtable/configtable/inlog/outlog
    computer = talk to the NtFrs service on this machine.

    ntfrsutl [memory|threads|stage] [computer]
    = list the service's memory usage
    computer = talk to the NtFrs service on this machine.

    ntfrsutl ds [computer]
    = list the service's view of the DS
    computer = talk to the NtFrs service on this machine.

    ntfrsutl sets [computer]
    = list the active replica sets
    computer = talk to the NtFrs service on this machine.

    ntfrsutl version [computer]
    = list the api and service versions
    computer = talk to the NtFrs service on this machine.

    ntfrsutl forcerepl [computer] /r SetName /p DnsName
    = Force FRS to start a replication cycle ignoring the schedule.
    = Specify the SetName and DnsName.
    computer = talk to the NtFrs service on this machine.
    SetName = Name of the replica set.
    DnsName = DNS name of the inbound partner to force repl from.

    ntfrsutl poll [/quickly[=[N]]] [/slowly[=[N]]] [/now] [computer]
    = list the current polling intervals.
    now = Poll now.
    quickly = Poll quickly until stable configuration retrieved.
    quickly= = Poll quickly every default minutes.
    quickly=N = Poll quickly every N minutes.
    slowly = Poll slowly until stable configuration retrieved.
    slowly= = Poll slowly every default minutes.
    slowly=N = Poll slowly every N minutes.
    computer = talk to the NtFrs service on this machine.

    Cool stuff – you can force replication, list out various FRS configuration info, or get performance stats. It’s great for scripting.

    In order to figure what happened, we will dump out some tables from the FRS JET database by executing:

    NTFRSUTL OUTLOG > outlog.txt
    NTFRSUTL IDTABLE > idtable.txt

    (Note: the output from NTFRSUTL IDTABLE is not the same as collecting IDTABLE information with FRSDIAG’s GUI console).

    We then start FRSDIAG and click the 'Browse' button. Drop down the Replica Set and select the one that contained deleted data. Click 'Add All' to add the members. Then click ok.

    Click 'Tools' then select 'Build GUID2Name for Target Server(s)'.

    This will create us a text file that lists out GUID and its associated SERVER NAME, like so:

    ======================================================
    Replica Set GUID : 6f83352f-f404-4eda-a714ae1691e3e9d8
    Replica Set Name : FRSTEST|FRSTESTLINK
    ======================================================

    GUID                                MEMBER NAME                            SERVER NAME
    ----                                -----------                            -----------
    30409f5d-8493-41ad-a98ab03fc1b795e5 {6AEC89F5-24B5-4C1B-B15E-6EFE5A60B75C} 2003srv16
    e8219dee-532a-4dff-83f09f036e331daa {6F617C11-2997-4134-952B-5B3572D4AF70} 2003srv17
    e8feaedc-6bce-41f4-94c39cade3932da8 {88600E1E-AAF8-4C70-A1C3-A36EB471E2B3} 2003srv13

    We open the IDTABLE.TXT file and search for our file we know was deleted:

    Table Type: ID Table for FRSTEST|FRSTESTLINK (1)
    FileGuid                     : 3647d318-502f-11dc-a1070003ff6813c5
    FileID                       : 00130000 00003e45
    ParentGuid                   : 2d7a8327-7308-4464-a63e367e39c27690 << Folder it was in
    ParentFileID                 : 000d0000 00003e37
    VersionNumber                : 00000001
    EventTime                    : Wed Aug 22, 2007 11:57:26 << when deleted
    OriginatorGuid               : 30409f5d-8493-41ad-a98ab03fc1b795e5
    << where deleted

    OriginatorVSN                : 01c7e4c3 ea6dd496
    CurrentFileUsn               : 00000000 001be408
    FileCreateTime               :
    FileWriteTime                :
    FileSize                     : 00000000 000000a0
    FileObjID                    : 3647d318-502f-11dc-a1070003ff6813c5
    FileName                     : importantfile13.doc << here's our file
    FileIsDir                    : 00000000
    FileAttributes               : 00000020 Flags [ARCHIVE ]
    Flags                        : 00000001 Flags [DELETED ] << Proof of deletion
    ReplEnabled                  : 00000001
    TombStoneGC                  : Sun Oct 21, 2007 11:57:26
    OutLogSeqNum                 : 00000000 00000000
    Spare1Ull                    : 00000000 00000001
    MD5CheckSum                  : MD5: a41eea20 979f04e9 dff7592a e8dc3e8b
    RetryCount                   : 0
    FirstTryTime                 :

    We then look in the OUTLOG.TXT to confirm the folder matches up using ParentGuid above:

    Table Type: Outbound Log Table for FRSTEST|FRSTESTLINK (1)
    SequenceNumber               : 0000008d
    Flags                        : 01000024 Flags [Content LclCo CmpresStage ]
    IFlags                       : 00000001 Flags [IFlagVVRetireExec ]
    State                        : 00000014  CO STATE:  IBCO_OUTBOUND_REQUEST
    ContentCmd                   : 00002000 Flags [RenNew ]
    Lcmd                         : 0000000f  D/F 1   NoCmd
    FileAttributes               : 00000010 Flags [DIRECTORY ]
    FileVersionNumber            : 00000001
    PartnerAckSeqNumber          : 00000000
    FileSize                     : 00000000 00000000
    FileOffset                   : 00000000 00000000
    FrsVsn                       : 01c7e44b df699803
    FileUsn                      : 00000000 001bd690
    JrnlUsn                      : 00000000 001bd690
    JrnlFirstUsn                 : 00000000 001bd690
    OriginalReplica              : 1  [???]
    NewReplica                   : 1  [???]
    ChangeOrderGuid              : cf3bc76b-b3e1-4e72-ae8962371bb48501
    OriginatorGuid               : e8feaedc-6bce-41f4-94c39cade3932da8
    FileGuid                     : 2d7a8327-7308-4464-a63e367e39c27690 << there’s our GUID
    OldParentGuid                : 53605485-4dd4-4b9a-bc5a022760515559
    NewParentGuid                : 53605485-4dd4-4b9a-bc5a022760515559
    CxtionGuid                   : 92f5d906-cd45-4639-973021461e454c8b
    Spare1Ull                    :
    MD5CheckSum                  : MD5: b68d5ccf 21f8b5dd e7eb48f1 f45b01d9
    RetryCount                   : 0
    FirstTryTime                 : Wed Aug 22, 2007 11:55:21
    EventTime                    : Wed Aug 22, 2007 11:55:18
    FileNameLength               :       34
    FileName                     : ImportantFolder13 << definitely our folder that was deleted
    Cxtion Name                  : <Jrnl Cxtion> <- <Jrnl Cxtion>\<Jrnl Cxtion>
    Cxtion State                 : Joined
     

    Then we can verify back in the IDTABLE.TXT what the time and source were:

    Table Type: ID Table for FRSTEST|FRSTESTLINK (1)
    FileGuid                     : 2d7a8327-7308-4464-a63e367e39c27690
    FileID                       : 000d0000 00003e37
    ParentGuid                   : 53605485-4dd4-4b9a-bc5a022760515559
    ParentFileID                 : 00030000 000039bc
    VersionNumber                : 00000002
    EventTime                    : Wed Aug 22, 2007 11:57:23 << there's the delete time
    OriginatorGuid               : 30409f5d-8493-41ad-a98ab03fc1b795e5 << here's the source of the delete
    OriginatorVSN                : 01c7e4c3 ea6dd495
    CurrentFileUsn               : 00000000 001beba8
    FileCreateTime               :
    FileWriteTime                :
    FileSize                     : 00000000 00000000
    FileObjID                    : 2d7a8327-7308-4464-a63e367e39c27690
    FileName                     : Dc9
    FileIsDir                    : 00000001
    FileAttributes               : 00000010 Flags [DIRECTORY ]
    Flags                        : 00000001 Flags [DELETED ] << confirmed that it's been deleted
    ReplEnabled                  : 00000001
    TombStoneGC                  : Sun Oct 21, 2007 11:57:23
    OutLogSeqNum                 : 00000000 00000000
    Spare1Ull                    : 00000000 00000000
    MD5CheckSum                  : MD5: b68d5ccf 21f8b5dd e7eb48f1 f45b01d9
    RetryCount                   : 0
    FirstTryTime                 :

    ===

    Now we look back at the GUID2Name table we generated earlier, we can see that:

    30409f5d-8493-41ad-a98ab03fc1b795e5 = 2003SRV16.

    So we know that at Wed Aug 22, 2007 11:57:23 AM on server 2003SRV16, something or someone deleted all this data. Wasn’t that easy? :) If we had object access auditing enabled on that server at the time and the folder configured for auditing, we can even see who did it. More on this later…

    DFSR Deletion Forensics – The Where and When

    Since DFSR exposes nearly all its interfaces through WMI, we can use a powerful command-line utility called WMIC that can be used to return useful info from the databases. This way we don’t need to rely on add-on tools and debug logs and such. For my example below I am intentionally not using VBScript as I want everyone to understand exactly what it is we’re doing – but feel free to script it up, all the WMI classes are well documented on MSDN.

    So here we go again: 

    • We have our three servers 2003SRV13, 2003SRV16, and 2003SRV17.
    • All are in a Replication Group called ImportantData
    • They have a Replicated Folder called… wait for it… ReplicatedFolder.
    • That folder contains various files and folders, including a folder called ImportantSubFolder. It contains some files, including one called critical.doc. Naturally, someone has deleted critical.doc… let’s figure out where and when.

    First we open a CMD prompt as an admin and dump the Replicated Folder info like so:

    C:\>wmic /namespace:\\root\microsoftdfs path DfsrReplicatedFolderInfo get ReplicatedFolderGuid,ReplicatedFolderName,ReplicationGroupName > rfinfo.txt

    This returns the following into our rfinfo.txt file:

    ReplicatedFolderGuid                  ReplicatedFolderName  ReplicationGroupName
    8722EF11-6466-4472-888F-11B8A57B68A4  replicatedfolder      importantdata

    Now we have enough info to confirm we're looking at the right data. Let's get the status of the deleted file by running this command and providing it the file name and the ReplicatedFolderGuid from above:

    C:\>wmic /namespace:\\root\microsoftdfs path DfsrIdRecordInfo WHERE (filename='critical.doc' and replicatedfolderguid='8722EF11-6466-4472-888F-11B8A57B68A4') get filename,flags,updatetime,GVsn > file.txt

    Our output file.txt contains:

    FileName Flags GVsn UpdateTime

    critical.doc 4 {BAA4E6D9-BF1A-4C83-ADF4-FDFD481AE2FC}-v113867 20070823233010.774625-000

    We can see that it was deleted on Aug 23, 2007 at 23:30:10 (11:30 PM) GMT time. For us this means 7:30PM EDT.

    The Flags value of 4 tells us it's been deleted; examine the chart below. An ordinary replicated file will have a Flags value of 5 (meaning 0x1 && 0x4 for Present and Replicated). If 4 the Present flag has been removed meaning the file is tombstoned, i.e. removed from the replica.

    Value Meaning

    PRESENT_FLAG
    0x1

    The resource is not a tombstone; it is available on the computer.

    NAME_CONFLICT_FLAG
    0x2

    The tombstone was generated because of a name conflict. This flag is meaningful only for tombstones.

    UID_VISIBLE_FLAG
    0x4

    The ID record has already been sent out to other partners; therefore, other partners are aware of this resource.

    JOURNAL_WRAP_FLAG
    0x10

    The volume has had a journal wrap and the resource has not been checked to determine if there is any change by the journal wrap recovery process.

    PENDING_TOMBSTONE_FLAG
    0x20

    The ID record is in the process of being tombstoned (or deleted.)

    The GVsn value is important in that the GUID inside those curly brackets will always contain the unique database GUID of the server where the file was last changed. Since deleting the file counts as a change, now we just need to figure out who owns that GUID. So let's use the DFSRDIAG command to find the culprit:

    C:\>dfsrdiag guid2name /guid:BAA4E6D9-BF1A-4C83-ADF4-FDFD481AE2FC /rgname:importantdata

    Which returns:

    Object Type : DfsrVolumeInfo
    Computer    : 2003SRV16.fabrikam.com  << The Server where the delete occurred
    Volume Guid : 346CA491-54BA-11DB-91ED-806E6F6E6963
    Volume Path : C:
    Volume SN   : 1826913329
    DB Guid     : BAA4E6D9-BF1A-4C83-ADF4-FDFD481AE2FC

    Badda-bing! That BAA4E6D9-BF1A-4C83-ADF4-FDFD481AE2FC GUID matches our GVsn above. There's our guy. What is it with people deleting files and folders off this 2003SRV16 server? We should have a chat with their site admin...

    The same exact steps will work for folders. Let's do it real fast this time and figure out where someone deleted the whole ‘importantsubfolder’:

    C:\>wmic /namespace:\\root\microsoftdfs path DfsrReplicatedFolderInfo get ReplicatedFolderGuid,ReplicatedFolderName,ReplicationGroupName > rfinfo.txt

    ReplicatedFolderGuid                  ReplicatedFolderName  ReplicationGroupName 
    8722EF11-6466-4472-888F-11B8A57B68A4  replicatedfolder      importantdata

    C:\>wmic /namespace:\\root\microsoftdfs path DfsrIdRecordInfo WHERE (filename='importantsubfolder' and replicatedfolderguid='8722EF11-6466-4472-888F-11B8A57B68A4') get filename,flags,updatetime,GVsn,clock > file.txt

    FileName            Flags  GVsn                                        UpdateTime                
    importantsubfolder  4      {97DA0CC3-DBB4-437F-BB6F-BE8A970FE318}-v31  20070823235332.792930-000 

    C:\>dfsrdiag guid2name /guid:97DA0CC3-DBB4-437F-BB6F-BE8A970FE318 /rgname:importantdata

       Object Type : DfsrVolumeInfo
       Computer    : 2003SRV13.fabrikam.com
       Volume Guid : 929F0871-54B9-11DB-B293-806E6F6E6963
       Volume Path : C:
       Volume SN   : 1826913329
       DB Guid     : 97DA0CC3-DBB4-437F-BB6F-BE8A970FE318

    Done! Looks a lot easier the second time around, doesn’t it?

    Note that when a deletion is replicated in to a DFSR server, the file by default is moved to \DfsrPrivate\ConflictAndDeleted under the root of the replicated folder. If the delete was not replicated in, but instead was the result of a local deletion, the file is moved to the Windows Recycle Bin (unless you held down the SHIFT key while deleting, in which case the file is deleted for good). By default the quota on ConflictAndDeleted is 660 MB but that is configurable on the Advanced Tab in the replicated folder properties. In the same location un-checking the “Move deleted files to Conflict And Deleted folder” box will make it so deletions that are replicated in are actually deleted for good without being moved to ConflictAndDeleted.

    The information about all the data residing in ConflictAndDeleted is contained in the \DfsrPrivate\ConflictAndDeletedManifest.xml file. When the quota is reached, files are purged from ConflictAndDeleted folder and the ConflictAndDeletedManifest.xml in the order that they were put there. This means you have a limited amount of time to catch a deletion and be able to restore it from ConflictAndDeleted.

    There is a sample script for restoring data from ConflictAndDeleted. This is needed because the folder structure of deleted data is flattened and all data resides directory off the root of ConflictAndDeleted, and the filename is appended with the GVSN. The script reads the ConflictAndDeletedManifest.xml so it knows the original file names and folder structure. But you can also determine that using the DfsrConflictInfo WMI class. For example you can check for the presence of a file in a given server’s ConflictAndDeleted folder by running:

    C:\>wmic /namespace:\\root\microsoftdfs /node:2003srv13 path DfsrConflictInfo where "filename like 'critical.doc%'" get * /format:textvaluelist

    ConflictFileCount=1
    ConflictPath=\\.\C:\replicatedfolder\importantsubfolder\critical.doc
    ConflictSizeInBytes=881211
    ConflictTime=20070823233010.000000-000
    ConflictType=5
    FileAttributes=32
    FileName=critical.doc-{97DA0CC3-DBB4-437F-BB6F-BE8A970FE318}-v27
    GVsn={97DA0CC3-DBB4-437F-BB6F-BE8A970FE318}-v27
    MemberGuid=2C50672F-32A2-4D7D-AF44-88E1812F6E08
    ReplicatedFolderGuid=8722EF11-6466-4472-888F-11B8A57B68A4
    ReplicationGroupGuid=6143BD54-C9CC-42E1-A1FA-03BB34BF87F2
    Uid={97DA0CC3-DBB4-437F-BB6F-BE8A970FE318}-v27

    The trailing % wildcard is needed because the FileName property has the GVSN of the file appended to it. The ConflictPath property contains the original path and file name for the file before it was deleted.

    Auditing – The Who

    Now that we’ve covered when and where, the Windows Auditing subsystem can be used to tell you who via the Object Access Auditing setting. The important take away is that you need to have it in place and the audit descriptors set on your files and folders before you need them. Setting it up after the files have gone missing isn’t going to buy you anything. I can tell your eyes have started to glaze over on this post so I’m going to wrap it up here. J

    To set up Object Access Auditing you can follow this checklist and set your critical replicated folders to audit EVERYONE for DELETE and process that with inheritance on down. We really don’t usually care about files being changed and certainly not added, but deletions drive end users nuts.

    An important thing to understand about Auditing in Windows 2000 and 2003 is that it’s bound by some legacy limitations in the Event Log system (no longer true in Windows Server 2008 or Vista). Basically, you want to keep the total size of all your event logs at around 300MB total or they will become unstable. You’ll find that enabling Object Access Auditing is going to make your Security event logs wrap pretty often, so if it was only 256MB you wouldn’t have much time for forensics. You can get around this by using KB312571 to configure AutoBackupLogFiles and save off the logs when they wrap automagically. Then they can be backed up and deleted periodically with a scheduled task or whatever you like.

    Auditing is not free – it costs CPU time, disk I/O, and can increase memory usage within LSASS. Please be sure to test this for your environment. Really.

    For excellent tips and deep explanation of Windows Auditing, check out the Windows Security Logging and Other Esoterica MSDN blog by Eric Fitzgerald (who ran Auditing for many years as a Program Manager).

    For everything Auditing, check out the Windows Server 2003 Auditing web portal

    - Ned Pyle

  • Ask the Directory Services Team

    Dynamic Ports in Windows Server 2008 and Windows Vista (or: How I learned to stop worrying and love the IANA)

    • 4 Comments

    Hi, Dave here. I’m a Support Escalation Engineer in Directory Services out of Charlotte, NC. Recently one of our consultants in the field deployed a Windows Server 2008 Beta 3 domain controller at a branch office to test management scenarios.  After doing this, they discovered that the server was not replicating with domain controllers at the main datacenter.  After running some network captures, they discovered that the local firewall was blocking replication traffic.

    The firewall in question was configured properly to allow Windows Server 2003 domain controllers to replicate, but the Win2008 domain controller was blocked.  This is not because we changed the way replication works per se.  Replication is still accomplished by server to server RPC calls, the same as in Win2003.  But we did change the underlying mechanism that the network stack uses to determine which ports those RPC calls use.

    By default, the dynamic port range in Windows Server 2003 was 1024-5000 for both TCP and UDP.

    In Windows Server 2008 (and Windows Vista), the dynamic port range is 49152-65535, for both TCP and UDP.

    What this means is that any server-to-server RPC traffic (including AD replication traffic) is suddenly using an entirely new port range over the wire. We made this change in order to comply with IANA recommendations about port usage. Therefore, if you start deploying Windows Server 2008 on your network, and are using firewalls to restrict traffic on your internal network you will need to update the configuration of those firewalls to compensate for the new port range.

    It doesn’t stop at RPC traffic though.  The dynamic port range is used for any and all outbound requests from your computer that don’t use a specific source port.  This means that if you fire up Internet Explorer and browse to a web page, the network traffic is going to source from a port higher than 49152 on Vista or 2008.  This means that potentially, any application that connects to other machines via the network could be impacted by a firewall that’s not configured for this change.  In Directory Services support here at Microsoft, we really care mostly about Active Directory related traffic, but this is something that everyone should watch out for. So for example, look at this snippet of a NETSTAT command run on a Vista machine where we are simply connected to a web site with IE7:

    C:\Windows\system32>netstat -bn

    Active Connections

    Proto Local Address Foreign Address State
    TCP 10.10.0.10:53556 65.59.234.166:80 TIME_WAIT
    TCP 10.10.0.10:53572 65.59.234.166:80 TIME_WAIT
    [iexplore.exe]

    In Vista and 2008, most administration of things at the network stack level is handled via NETSH.  Using NETSH, it’s possible to see what your dynamic port range is set to on a per server basis:

    >netsh int ipv4 show dynamicport tcp
    >netsh int ipv4 show dynamicport udp
    >netsh int ipv6 show dynamicport tcp
    >netsh int ipv6 show dynamicport udp

    These commands will output the dynamic port range currently in use.  Kind of a neat fact is that you can have different ranges for TCP and UDP, or for IPv4 and IPv6, although they all start off the same.

    In Windows Server 2003 the range always defaults to starting with TCP port 1024, and that is hard-coded.  But in Vista/2008, you can move the starting point of the range around.  So if you needed to, you could tell your servers to use ports 5000 through 15000 for dynamic port allocations, or any contiguous range of ports you wanted.  To do this, you use NETSH again:

       >netsh int ipv4 set dynamicport tcp start=10000 num=1000
       >netsh int ipv4 set dynamicport udp start=10000 num=1000
       >netsh int ipv6 set dynamicport tcp start=10000 num=1000
       >netsh int ipv4 set dynamicport udp start=10000 num=1000

    The examples above would set your dynamic port range to start at port 10000 and go through port 11000 (1000 ports).

    A few important things to know about the port range:

    · The smallest range of ports you can set is 255.
    · The lowest starting port that you can set is 1025.
    · The highest end port (based on the range you set) cannot exceed 65535.

    For more information on this, check out KB 929851.

    At this point you’re probably wondering what our recommendation is for configuring firewalls for AD replication with Windows Server 2008.  Generally speaking, we don’t recommend that you restrict traffic between servers on your internal network.  If you must deploy firewalls between servers, you should use IPSEC or VPN tunnels to allow all traffic between those servers to pass through, regardless of source or destination ports.  However, experience has taught us that some customers are going to want to restrict traffic, which is why it is possible to configure this range and control the ports that will be used.

    Here are two FAQs that have come up internally around this change:

    Q:  How do the changes to the dynamic port ranges affect AD replication?

    A:  AD replication relies on dynamically allocated ports for both sides of the replication connection.  This means that by default, replication traffic will now use ports higher than 49152 on both domain controllers involved in the transaction.

    Q:  Can the port that replication traffic uses be controlled?

    A:  It is still possible to restrict replication traffic to a specific port using the registry values documented in KB 224196.

    - Dave Beach

  • Ask the Directory Services Team

    Troubleshooting High LSASS CPU Utilization on a Domain Controller (Part 2 of 2)-

    • 2 Comments

    Last time I discussed troubleshooting the most common high CPU scenario within LSASS, which is the server being beaten up by a remote machine. Let’s talk now about the much less common but still possible:

    You find that the problem is coming from the DC itself.

    As I said in the previous post, this is a super rare situation these days. If you are on Windows 2000 Server SP4 or Windows Server 2003 SP1/SP2, we really don’t have any known issues where we can simply hand you a hotfix and send you on your way. The most likely cause is something foreign to the operating system – an add-on security package, a custom password synchronizer, a service running something security-related, etc. A very down and dirty way to check these is:

    Examine this registry key on your Windows Server 2003 (it will be slightly different on Win2000) machine being affected:

    HKEY_LOCAL_MACHINE\system\CurrentControlSet\Control\Lsa

    • Do you see anything in the Authentication Packages value except msv1_0?
    • Do you see anything in the Security Packages value except Kerberos msv1_0 schannel wdigest?
    • Do you see anything in the Notification Packages value except RASSFM KDCSVC WDIGEST scecli?

    Anything else in here on Windows Server 2003 may be suspect, as something or someone has injected non-standard libraries into LSASS. It may be intentional and the DLL is simply malfunctioning or misconfigured. It may be malicious. Find the file (it will nearly always be a DLL in the %windir%\system32 directory) and take a look at its properties:

    • Who made it?
    • Is it new?
    • Is it only on the machines having a problem and never on the ones that don’t?
    • Are there different versions between working and non-working machines?
    • Any of your colleagues recognize it?
    • Is it documented online?

    Once you think you have a handle on it, get a backup of your server and this registry key and remove the entry, then restart the server in a change control window when users are least affected. Does the high CPU utilization come back? It almost never does, trust me…

    If there was nothing of interest there, another good technique is to use MSCONFIG to identify and potentially disable applications that have been added on to the server.

    By checking the ‘Hide All Microsoft Services’ box you can see System Services that were added to the machine what did not ship with the operating system (technically speaking, you may see some services that are from us, such as Exchange). You can then temporarily set them to ‘Disabled’ and restart the server to test for the performance problem. The same can be done with the ‘Startup’ section, for apps that live in the RUN key of the registry. By using the ‘divide by half’ rule (where you disable half and test, disable the other half and test, then narrow down by halves until you find your culprit), you can usually get to the bad guy pretty quickly.

    You can see all of this info using the Microsoft Product Support Reporting Tools (MPSRPT_DirSvc.exe) as well.

    Notes

    This blog post is not about debugging – yes, some of the techniques I use above can be replaced with attaching WINDBG to LSASS, syncing symbols, and going to town to see what’s specifically wrong under the covers. The posting is for folks looking for remediation, not code-level root cause. And let’s be honest – we debug things like this every day on customer request. After all the work is done (and the billing against the customer’s contract – ouch!), we still have the same answer: please contact your vendor about this malfunctioning code, as only they can fix it. If’ you’d like a quick primer on seeing what modules may be loaded into LSASS by using a debugger and that might be suspect, please let us know and we’ll blog it up.

  • Ask the Directory Services Team

    Troubleshooting High LSASS CPU Utilization on a Domain Controller (Part 1 of 2)

    • 9 Comments

    Hi, Ned here. Today I’m going to talk about troubleshooting Domain Controllers that are responding poorly due to high LSASS CPU utilization. I’ve split this article into two parts because there are actually two major forks that happen in this scenario:

    · You find that the problem is coming from the network and affecting the DC remotely.
    · You find that the problem is coming from the DC itself.

    LSASS is the Local Security Authority Subsystem Service. It provides an interface for managing local security, domain authentication, and Active Directory processes. A domain controller’s main purpose in life is to leverage LSASS to provide services to principals in your Active Directory forest. So when LSASS isn’t happy, the DC isn’t happy.

    The first step to any kind of high LSASS CPU troubleshooting is to identify what ‘high’ really means. For us in Microsoft DS Support, we typically consider sustained and repeated CPU utilization at 80% or higher to be trouble if there’s no baseline of comparison. Periodic spikes that last a few seconds aren’t consequential (after all, you want your money’s worth of that new Quad Core), but if it lasts for ten to fifteen minutes straight and repeats constantly you may start seeing other problems: slower or failing logons, replication failures, etc. For you as an administrator of an AD environment, ‘high’ may mean something else – for example, if you are baselining your systems with MOM 2005 or SCOM 2007you may have already determined that normal CPU load on your DC’s is 20%. Then when all the DC’s start showing 50% CPU, this is aberrant behavior and you want to find out why. So it’s not necessary for the utilization to reach some magic number, just for it to become abnormal compared to what you know it typically baselines.

    The next step is determine the scope – is this happening to all DC’s or just ones in a particular physical or logical AD site? Is it just the PDC Emulator? This helps us focus our troubleshooting and data collection. If it’s just the PDCE can we temporarily move the role to another server (be careful about this if using external domain trusts that rely on static LMHOSTS entries)? If the utilization follows, the problem is potentially an older application using legacy domain API’s that were designed for NT4. Perhaps this application can be turned off, modified or updated. There could also be down-level legacy OS’s in the environment, such as NT4 workstations and they are overloading the PDCE. There are also components within AD that focus their attention on the PDCE as a matter of convenience (password chaining, DFS querying, etc). If you are only seeing the issue on the PDCE, examine Summary of "Piling On" Scenarios in Active Directory Domains.

    The next step is to identify if the issue is coming from the network or on the DC itself. I’ll be frank here – 99.999% of the time the issue is going to come off box. If you temporarily pull the network cable from the DC and wait fifteen minutes, LSASS is nearly guaranteed to drop back down to ~1% (why 15 minutes? That’s above the connection timeout for most components and after that time the DC should have given up on trying to service any more requests that were already queued). If it doesn’t drop, we know the problem is local to this DC. So here’s where this blog forks:

    You find that the problem is coming from the network and affecting the DC remotely. It’s not just the PDCE being affected.

    We can take a layered approach to troubleshooting high LSASS coming from the network. This involves a couple of tools and methods:

    · Windows Server 2003 Performance Advisor (SPA)

    · Wireshark (formerly Ethereal) network analysis utility (note: Wireshark is not affiliated with or supported by Microsoft in any way. Wireshark is used by Microsoft Product Support for network trace analysis in certain scenarios like this one. We also use Netmon 3.1, NMCAP, Netcap, and the built-in Netmon 2.x Lite tools. This is in no way an endorsement of Wireshark.).

    Server Performance Analyzer (SPA) can be useful for seeing snapshots of Domain Controller performance and getting some hints on what’s causing high CPU. While it’s easier to analyze than a network trace, it’s also more limited in what it can understand. It runs on the affected server, so if the CPU is so busy that the server isn’t really responsive it won’t be helpful. Let’s take a look at a machine which is seeing fairly high CPU, but where it’s still usable:

    We’ve installed SPA, started it up and selected Active Directory for our Data Collector, like so:

    We then execute the data collection which then runs for 15 minutes. This reads performance counters and other mechanisms in order to create us some reports.

    We open that current report and are greeted with some summary information. We can see here that overall CPU is at 61% and that most of the CPU time is against LSASS. We also see that it’s mainly LDAP requests eating up the processor, and that one particular remote machine is accounting for an abnormally large amount of it.

    So we drill a little deeper into the details from SPA and look in the unique LDAP searches area. The two machines below are sending a deep query (i.e. it searches all subtrees from the base of our domain naming context) using some filters based on attributes of ‘whenCreated’ and ‘whenChanged’. Odd.

    We’re still not convinced though – after all, SPA takes short snapshots and it really focuses on LDAP communication. What if it happened to capture a behavioral red herring? Let’s get some confirmation.

    We start by getting some 100MB full frame network captures. We can use the built-in Netmon Lite tool, use NETCAPfrom the Windows Support Tools, or anything you feel comfortable with. Doing less than 100MB means our sample will be too small; doing more than 100MB means that the trace filtering becomes unwieldy. Getting more than one is advisable.

    So we have our CAP file and we open it up in Wireshark. We click ‘Statistics’ then ‘Conversations’. This executes a built in filter that generates a TSV-formatted output (which you can throw into Excel and graph if you want to be fancy for upper management). Hit the IPv4 tab and we see:

    Whoa, very interesting. 10.80.0.13 and 10.70.0.11 seem to be involved in two massive conversations with our DC, and everything else looks pretty quiet. Looking back at our SPA we see that .13 address listed and if we NSLOOKUP  the 11 address we find it’s the XPPRO11A machine. I think we’re on to something here.

    We set a filter for the 10.80.0.13 machine in our CAP file and set it to only care about LDAP traffic, like so:

    We can see that the 10.80.0.13 machine is making constant LDAP requests against our DC. That unto itself isn’t very normal – a Windows machine doesn’t typically send a barrage of queries all day, it sends small spurts to do specific things like logon, lookup a group membership, or process group policy. What exactly is this thing doing? Let’s look at the details of one of these requests in Wireshark:

    Well, it’s definitely the same thing we saw in SPA. We’re connecting to the base of the domain Litewareinc.com, and we’re searching for every single object’s create time stamp. We know that we have 100,000 users, 100,000 groups, and 100,000 computers in this domain, and using such a wide open query is going to be expensive for LSASS to process. But it still seems that we should be able to handle this? What makes this attribute cost us so much processing?

    We run regsvr32 schmmgmt.dll in order to gain access to the Schema Management snap-in, then run MMC.EXE and add Active Directory Schema. Under the Attributes node we poke around and find createTimeStamp.

    Well isn’t that a kick in the teeth – this attribute isn’t indexed by default! No wonder it’s running so painfully. Luckily DC’s cache frequently used queries or we’d be in even worse shape with disk IO. We don’t want to just go willy-nilly adding indexes in Active Directory as that can have its own set of memory implications. So we have a quick chat with the owner of those machines and he admits that they recently changed their custom LDAP application yesterday (when the problem started!). It was supposed to be getting back some specific information about user account creation but it had a bug and it was asking about every single object in Active Directory. They change their app and everything returns to normal – high fives for the good guys in Server Administration.

    So today we learned about troubleshooting high LSASS CPU processing from a remote source. Next time we will diagnose a machine that’s having problems even after we pull it off the network. Stay tuned.

    Notes:

    Read this excellent post on SPAfor more info on that tool.

    Read this write-upon query inefficiency. This is what you give to that LDAP developer that was beating up your DC’s!

    LSASS memory utilization is an entirely different story. The JET database used by Domain Controllers is highly optimized for read operations, and consequently LSASS tries to allocate as much virtual memory through caching as it possibly can to make queries fast (and deallocates that memory if requested by other applications). This is why we recommend that whenever possible, a DC’s role should only be a DC – not also a file server, a SQL server, and Exchange server, an ISA box, and the rest. While this isn’t always possible, it’s our best practice advice. For more, read: Memory usage by the Lsass.exe process on domain controllers that are running Windows Server 2003 or Windows 2000 Server

    For Windows 2000 we have an older SPA-like tool called ADPERF, but it’s only available if you open a support case with us.

     

    For part 2, go here.

    - Ned Pyle

Page 88 of 89 (707 items) «8586878889