Blog - Title

October, 2008

  • Manually Clearing the ConflictAndDeleted Folder in DFSR

    Ned here again. Today I’m going to talk about a couple of scenarios we run into with the ConflictAndDeleted folder in DFSR. These are real quick and dirty, but they may save you a call to us someday.

    Scenario 1: We need to empty out the ConflictAndDeleted folder in a controlled manner as part of regular administration (i.e. we just lowered quota and we want to reclaim that space).

    Scenario 2: The ConflictAndDeleted folder quota is not being honored due to an error condition and the folder is filling the drive.

    Let’s walk through these now.

    Emptying the folder normally

    It’s possible to clean up the ConflictAndDeleted folder through the DFSMGMT.MSC and SERVICES.EXE snap-ins, but it’s disruptive and kind of gross (you could lower the quota, wait for AD replication, wait for DFSR polling, and then restart the DFSR service). A much faster and slicker way is to call the WMI method CleanupConflictDirectory from the command-line or a script:

    1.  Open a CMD prompt as an administrator on the DFSR server.
    2.  Get the GUID of the Replicated Folder you want to clean:

    WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderconfig get replicatedfolderguid,replicatedfoldername

    (This is all one line, wrapped)

    Example output:


    3.  Then call the CleanupConflictDirectory method:

    WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where "replicatedfolderguid='<RF GUID>'" call cleanupconflictdirectory

    Example output with a sample GUID:

    WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where "replicatedfolderguid='70bebd41-d5ae-4524-b7df-4eadb89e511e'" call cleanupconflictdirectory


    4.  At this point the ConflictAndDeleted folder will be empty and the ConflictAndDeletedManifest.xml will be deleted.

    Emptying the ConflictAndDeleted folder when in an error state

    We’ve also seen a few cases where the ConflictAndDeleted quota was not being honored at all. In every single one of those cases, the customer had recently had hardware problems (specifically with their disk system) where files had become corrupt and the disk was unstable – even after repairing the disk (at least to the best of their knowledge), the ConflictAndDeleted folder quota was not being honored by DFSR.

    Here’s where quota is set:


    Usually when we see this problem, the ConflictAndDeletedManifest.XML file has grown to hundreds of MB in size. When you try to open the file in an XML parser or in Internet Explorer, you will receive an error like “The XML page cannot be displayed” or that there is an error at line X. This is because the file is invalid at some section (with a damaged element, scrambled data, etc).

    To fix this issue:

    1. Follow steps 1-4 from above. This may clean the folder as well as update DFSR to say that cleaning has occurred. We always want to try doing things the 'right' way before we start hacking.
    2. Stop the DFSR service.
    3. Delete the contents of the ConflictAndDeleted folder manually (with explorer.exe or DEL).
    4. Delete the ConflictAndDeletedManifest.xml file.
    5. Start the DFSR service back up.

    For a bit more info on conflict and deletion handling in DFSR, take a look at:

    Staging folders and Conflict and Deleted folders (TechNet)
    DfsrConflictInfo Class (MSDN)

    Until next time...

    - Ned "Unhealthy love for DFSR" Pyle

  • “Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

    Hi, Gary from Directory Services here and I’m going to talk today about the concept of “lag sites” or “hot sites” as a recovery strategy. I recently had a case where the customer asked if the replication interval for a site link could be set higher than 10,080 minutes (7 days). The quick answer was that Active Directory only supports values from 15 up to 10,080 minutes and the schedule is based on a week. If the replinterval attribute on the site link is manually set to something lower than 15 it will use the default of 15. If it is set to something higher than 10,080, it will be ignored and 10,080 will be used.

    But the underlying question kept coming back to the recommendation of a latent “lag site”.

    First let me give a quick definition of a lag site or hot site and its general intended purpose. A lag site is just an Active Directory site that is configured with a replication schedule of one, two or maybe three days out of the week. That way it will have data that would be intentionally out-of-date as of the last successful inbound replication. It is sometimes used as a quick way to recover accidentally deleted objects without having to resort to finding the most recent successful backup within the tombstone lifetime of the domain that has the data.

    This sounds like a decent idea, in theory. However, Microsoft Support does not recommend a lag site as a disaster recovery strategy. Servicing products such as hotfixes and service packs not recognize quasi-offline DC state monitoring software may also detect the state of a lag site DC as malfunctioning and attempt to re-enable it (or tell an unwitting administrator to do so). Microsoft makes no guarantees that the servicing and monitoring products would not re-enable Netlogon and KDC services in a lag site. In addition, other Microsoft products, such as Exchange Server, are not designed to operate in a lag site and they may not function properly with lag site DCs.

    The following lists some reasons why lag sites should not be relied upon as a disaster recovery strategy, especially in lieu of proper Active Directory System State backups:

    Lag sites are not guaranteed to be intact in a disaster:

    • If the disaster is not discovered in time before replication occurs, the problem is replicated to the lag site, and the lag site cannot be used to undo the disaster. A lag site typically needs to be three days latent in order to cover situations that occur during the weekend where visibility is low. However this means that you are actually forced to ‘lose’ more changes than a reliable daily backup being run on domain controllers.
    • Thus, the administrator must act immediately when a disaster occurs: inbound and outbound replications must be disabled and repadmin /force must be forbidden.

    Replicating from lag site might have unrecoverable consequences:

    • Since a lag site contains out-of-date data, using it as a replication source may result in data loss depending on the amount of latency between the disaster and the last replication to the lag site.
    • If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes.

    Lag sites pose security threats to the corporate environment:

    • For example, when an employee is fired from the company, his/her account is immediately deleted (or disabled) from Active Directory, but the account might still be left behind in the lag site. If the lag site domain controllers allow logons, this could potentially lead to unauthorized users with access to corporate resources during the lag site replication delay “window”.

    Careful consideration must be put in configuring and deploying lag sites:

    • An Administrator needs to decide the number of lag sites to deploy in a forest. The more domains that have lag sites, the more likely one can recover from a replicated disaster. However, this would also mean increased hardware and maintenance costs.
    • An Administrator needs to decide the amount of latency to introduce. The shorter the latency, the more up-to-date and useful the data would be in the lag site. However, this would also mean that administrators must act quickly to stop replication to the lag site when a disaster occurs.

    The above list is not exhaustive, and there could be other unseen problems with deploying lag sites as a disaster recovery strategy. It has always been strongly recommended that the best way to prepare for disasters such as mass deletions, mass password changes, etc. is to backup domain controllers daily and verify these backups regularly through test restorations.

    Finally, keep in mind that testing your disaster recovery routine is vital both prior to beginning to rely on that routine in case of failure as well as once you begin to use it as your recovery strategy. Surprise is never good when a disaster strikes.

    Here are some links to Microsoft recommended recovery steps and practices:

    840001 How to restore deleted user accounts and their group memberships in Active Directory -

    Useful shelf life of a system-state backup of Active Directory -

    Managing Active Directory Backup and Restore -

    Step-by-Step Guide for Windows Server 2008 AD DS Backup and Recovery -

    Active Directory Backup and Restore in Windows Server 2008 -

    - Gary Mudgett

  • Getting a CMD prompt as SYSTEM in Windows Vista and Windows Server 2008

    Ned here again. In the course of using Windows, it is occasionally useful to be someone besides… you. Maybe you need to be an Administrator temporarily in order to fix a problem. Or maybe you need to be a different user as only they seem to have a problem. Or maybe, just maybe, you want to be the operating system itself.


    Think about it. What if you are troubleshooting a problem where an agent process like the SMS Client isn’t working? Or an anti-virus service is having issues reading the registry? If only we had some way to look at things while logged in as SYSTEM.

    What is SYSTEM and why is Vista/2008 special?

    SYSTEM is actually an account; in fact, it’s a real honest-to-goodness user. Its real name is “NT Authority\Local System” and it has a well-known SID of S-1-5-18. All Windows computers have this account and they always have the same SID. It’s there for user-mode processes that will be executed as the OS itself.

    This is a bit tricky in Windows Vista and Windows Server 2008 though. In previous operating systems you could simply start a scheduled task CMD prompt and have it interact with the desktop easily. This was construed as a security hole to some people, so in Vista/2008 it’s not possible anymore.

    So how can we take off our glasses and put on the cape with the big red S?

    Method one - PSEXEC

    An easy way to get a CMD prompt as SYSTEM is to grab PSEXEC from Microsoft Sysinternals:

    1. Download PSEXEC and unzip to some folder.

    2. Open an elevated CMD prompt as an administrator.

    3. Navigate to the folder where you unzipped PSEXEC.EXE

    4. Run:

         PSEXEC -i -s -d CMD

    5. You will have a new CMD prompt open, as though by magic.

    6. Type the following in the new CMD prompt to prove who you are:

         WHOAMI /USER


    There you go – anything that happens in that CMD prompt or is spawned from that prompt will be running as SYSTEM. You could run regedit from here, start explorer, or whatever you need to troubleshoot as that account.

    That was pretty easy – why do I have some more methods below? Unfortunately, in several previous versions of the PSEXEC tool the –s (system) switch has not worked. As of version 1.94 it does work again, but that is no guarantee for the future. This brings us to a more iron-clad technique:

    Method two - REMOTE

    We can use the REMOTE.EXE tool which comes as part of the Windows Debugger. While it’s a bit more cumbersome, it will always work:

    1. Download the Windows Debugger (x86 or x64) and install it anywhere (we just need its copy of REMOTE.EXE, so feel free to copy that file elsewhere and uninstall the debugger when done; in the example below I installed to “c:\debuggers”).

    2. Open an elevated CMD prompt as an administrator.

    3. Run:

      AT <one minute from now> c:\debuggers\remote.exe /s cmd SYSCMD

    Where you use 24-hour clock notation (aka ‘military time’). For example, right now it is 3:57PM, so I type:

      AT 15:58 c:\debuggers\REMOTE.EXE /s cmd SYSCMD

    4. Then once 15:58 (3:38PM) is reached, you can run:

      C:\debuggers\REMOTE.EXE /c <your computer> SYSCMD

    Where you are typing your computers’ own NetBIOS name. So for example:

      C:\debuggers\remote.exe /c nedpyle04 SYSCMD


    Neato. I used REMOTE to connect to REMOTE on the same computer. This is a good example of a client-server RPC application. The SYSCMD option I keep using is just a marker that identifies the remote session. Technically you could have lots of these going at once, each with a different marker.

    If I then use WHOAMI /USER again, the proof:


    To leave just type EXIT


    Method two and a half – REMOTE and the Task Scheduler

    Maybe you want to have REMOTE ready to go at a moment’s notice (you plan to do this a lot, eh)? Or what if you want to use one of the other SYSTEM-type accounts, like “Local Service” and “Network Service”? PSEXEC can’t do that and neither can the old AT command.

    Here’s some XML and commands you can use to make the server portion of REMOTE be ready at an instant for various accounts. This time we’ll use the newer, slicker SCHTASKS tool:

    1. Copy the following sample into notepad and save as <something>.xml (in my sample below, I save to “c:\temp\RaS.xml”)

    <?xml version="1.0" encoding="UTF-16"?>
    <Task version="1.2" xmlns="">
      <Triggers />
        <Principal id="Author">
      <Actions Context="Author">
          <Arguments>/s cmd SYSCMD</Arguments>

    Note the highlighted elements above. You will need to make sure that these paths match where REMOTE.EXE is located. Also, the UserID can be set to anything you like, including “nt authority\local service” or “net authority\network service”.

    2. Open an elevated CMD prompt as an administrator.

    3. Run:

       SCHTASKS /create /tn <some task name> /xml <path to xml file>

    Where you provide a real task name and XML file. For example:

       SCHTASKS /create /tn RemoteAsSystem /xml c:\temp\RaS.xml


    4. This created a scheduled task with all the REMOTE info filled out.

    5. Now we can run the REMOTE server piece anytime we want, as often as we want with:

       SCHTASKS /run /tn RemoteAsSystem


    6. Now we can connect with just like we did back in method two:


    That’s it. Hopefully you find this useful someday (or maybe I should hope you never have to find it useful). Got a comment, or another way to do this? Let us know.

    - Ned “Nubbin” Pyle

  • Port Exhaustion and You (or, why the Netstat tool is your friend)

    Hi, David here. Today I wanted to talk about something that we see all the time here in Directory Services, but that doesn’t usually get a lot of press. It’s a condition we call port exhaustion, and it’s a problem that will cause TCP and UDP communications with other machines over the network to fail.

    Port exhaustion can cause all kinds of problems for your servers. Here’s a list of some symptoms:

    - Users won’t be able to connect to file shares on a remote server
    - DNS name registration might fail
    - Authentication might fail
    - Trust operations might fail between domain controllers
    - Replication might fail between domain controllers
    - MMC consoles won’t work or won’t be able to connect to remote servers.

    That’s just a sample of the most common symptoms that we see. But here’s the big one: You reboot the server(s) involved, and the problem goes away - temporarily. A few hours or a few days later, it comes back.

    So what is port exhaustion? You might think that it’s where the ports on the computer get tired and just start responding slower over time – but, well, computers aren’t human, and they certainly aren’t supposed to get tired. The truth is much more insidious. What port exhaustion really means is that we don’t have any more ports available for communication.

    Now, some administrators out there are going to suspect a memory leak of some kind when this problem happens, and it’s true that memory leaks can cause the same type of issues (I’ll explain why in a moment). But usually we find that most of the time, memory isn’t the issue, and you can end up trying to troubleshoot memory problems that aren’t there.

    In order to understand port exhaustion, you need to first understand that everything I listed above requires servers to be able to initiate outbound connections to other servers. It’s the word outbound that’s important. We usually think of network connectivity requirements in inbound terms – our clients need to connect to a server on a specific TCP or UDP port, like port 80 for web browsing or port 445 for file shares (SMB). But we very rarely think about the other side of that, which is that the communication has to have a source port available to use.

    As you might know, there are 65,535 ports available for TCP and UDP connections in TCP/IP. The first 1024 of those are reserved for specific services and protocols to use as senders or listeners. For example, DHCP requests will always come from port 67 on a client, and the DHCP service (the server component) always listens on port 68. That means that they listen on these ports for inbound communications. Beyond that, ports get dynamically assigned to services and applications for either inbound or outbound use as needed. A port can normally only do one thing – we can either use it to listen for connections from other machines on the network, or we can use it to initiate connections to other machines on the network, but we usually can’t do both (some services cheat and use ports bi-directionally, but this is relatively rare).

    So 65535–1024 is still 64511 ports. That’s a lot! We should almost never run out, right? You’d think so, but there’s another limitation here that you might not be aware of, and that limitation is that we don’t actually use the full range of ports for any dynamic communications. Dynamic communication is any sort of network communication that doesn’t already have a port specifically reserved for sending or receiving it – in other words, the vast majority of network traffic that a Windows computer generates.

    By default in the Windows operating system, we only have a limited number of ports available for outbound communications. We sometimes call these user ports, because user-mode processes are what we really expect to be using these things most often. For example, when you connect to a file server to access a file, you’re connecting to (usually) either port 445 or port 139 on the other side to retrieve that file. However, in order to negotiate the session, you need a client port on your computer to use for this, and so the application making the connection (Windows Explorer, in the case of browsing files) gets a dynamically-assigned port to use.

    Since we only have a limited number of ports available by default, you can run out of them – and when you run out, you’re no longer able to make new outbound connections from your computer to other computers on the network. This can cause an awful lot of communication to break down – including the communication that’s needed to authenticate users with Kerberos.

    In Windows XP/2003 (and earlier) the dynamic port range that we use for this was 1024-5000 by default. So, you had a little less than 4000 ports available for outbound network communication. Ports above that range were generally reserved for application listeners. In Windows Vista and 2008, we changed that range to be more in line with IANA recommendations. If you’re curious, you can read the KB article here. The upshot of the changes is that we actually have a larger default dynamic range in Vista and 2008, but we also messed up everyone who’s ever configured internal firewalls to block high ports (which by the way is something we don’t recommend doing on an internal network. Either way, the end result is that you’ve got a few more ports available to use by default in Vista and 2008.

    Even so, it’s still possible to run out of ports. And when this happens, communication starts to break down. We run into this scenario a lot more often than you might think, and it causes the types of issues I detailed above. 99% of the time when someone has this problem, it happens because an application has been grabbing those ports and not releasing them properly. So, over time, it uses up more and more ports from the dynamic range until we run out.

    In most networks there are potentially dozens, if not hundreds, of different applications that might be communicating with other servers over the network – security tools, management and monitoring tools, line of business applications, internal server processes, and so on. So when you have a problem like this, narrowing down which application is causing the problem can be a challenge. Fortunately, there are a couple of tools that make this easier, and the best part is, they come with the operating system.

    The first tool is NETSTAT. Netstat queries the network stack and shows you the state of your network connection, including the ports you’re using. Netstat can tell you which ports are in use, where the communication is going, and what application has the port open.

    Another cool tool is Port Reporter. Port Reporter does everything that Netstat does, but it runs in real-time rather than just a point-in-time snapshot like Netstat does. Netstat is included in Windows, but you can download Port Reporter for free from our website. (All my examples in this blog will use Netstat).

    So, if you suspect that you might have a port exhaustion problem, then you’d want to run this command:

    netstat –anob > netstat.txt

    This runs Netstat and dumps the output to a text file. You’d want to use a text file since trying to look at the output inside a command prompt is a quick way to give yourself a migraine. Once you’ve done this, you can examine the text file output, and you’ll be able to see what processes are using up ports. What you want to look for is entries where the same process is using a lot of ports on the machine. That is the most likely culprit.

    Here’s an example of what you get with netstat (I’ve snipped it for length)

    C:\Windows\System32\netstat –ano


    Notice that you can see the port you’re using locally, the one you’re talking to remotely, and what the state of the connection is. You can also get the process ID (that’s the o switch in the netstat command), and you can even have netstat try to grab the name of the process (use netstat –anob).

    C:\Windows\System32\netstat -anob


    What you’re looking for in the output is a single process that is using up a large number of ports locally. So for example, on my machine above we can see that PID 608 is using several ports. Usually what will happen when you run into port exhaustion is that you will see that one (or two) processes are using 90-95% of the dynamic range. The other piece of information to look at is where they’re talking to remotely, and what the state of the connection is. So, if you see a process that’s using up a lot of ports, talking to a single remote address or several remote addresses, and the state of the connection is something like TIME_WAIT, that’s usually a dead giveaway that this process is having a problem and not releasing those ports properly.

    Once you have this information, you can usually get things working again by turning off the offending process – but that’s only a temporary fix. Odds are, whatever was causing the problem was a legitimate piece of software that you want to have running. Usually when you get to this stage we recommend contacting the vendor of that application, or taking a look at whatever other servers the application might be communicating with, in order to get a permanent fix.

    I mentioned above that memory leaks can cause this behavior too – why is that exactly? What happens is that in order to get a port to use for an outbound connection, processes need to acquire a handle to that port. That handle comes out of non-paged pool memory. So, if you have a memory leak, and you run out of non-paged pool, processes that need to talk to other machines on the network won’t be able to get the handle, and therefore won’t be able to get the port they need. So if you’re looking at that Netstat output and you’re just not seeing anything useful, you might still have a memory issue on the server.

    At this point you really should be contacting us, since finding and fixing it is going to require some debugging. Cases that get this far are rare however, and most of the time, the Netstat output is going to give you the smoking gun you need to find the offending piece of software.

    - David “Fallout 3 Rules” Beach

  • Preventing large time offset problems

    Greetings, Todd here and I wanted to take a few moments to talk to you about an issue that arises from time to time. I will start this time-related issue exploration with a worst case scenario.

    The Primary Domain Controller Emulator (also known as the PDCe) in your forest root has a hardware issue which requires the replacement of the motherboard or even the replacement of the machine due to theft, fire, water based fire suppression system damage, etc, etc… The motherboard or machine is replaced and the machine is started.

    Sometime shortly after the motherboards or system replacement it is noted that AD replication is failing, everywhere. The error you are receiving is:

    Event Type: Error
    Event Source: NTDS Replication
    Event Category: Replication Event
    ID: 2042
    Date: 12/01/2008 Time: 1:13:153 AM
    Computer: DALTX-DC00

    Description: It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

    The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.

    Time of last successful replication:

    2004-10-27 08:59:52

    Invocation ID of source:

    154ef845-f894-054e-88fc-a205dcbff605 Name of source:

    Tombstone lifetime (days): 60

    So here is the breakdown of our time related AD replication failure: the motherboard that was replaced never had its time set in the BIOS, so the time the OS referenced was from 2003. It would probably be helpful for you to know that the OS, at startup, will read the BIOS CMOS clock and set the time within the OS to this value. When the machine booted up it read the BIOS time and proceeded to set the current system time to the new setting even though it was 5 years in the past.

    This machine being the PDCe for the forest means that the other DC’s in the root and the other PDCe’s in the other domains in the forest will sync from it. So we have just managed to propagate a bogus time setting to all the other DC’s in the forest.

    For a detailed explanation of how the Windows Time hierarchy works please review the TechNet documentation

    The PDCe at the root of the forest then syncs from its local or Internet time source and set the time properly and this new time setting is then propagated throughout the environment.

    All the DC’s that replicated when the date was set back to 2003 after which we receive the current time and before replication we check to see when we last replicated. The last replication time date stamp shows 2003 so as far as the machine is concerned we have not replicated for 5 years which just slightly exceeded the 60 -180 day Tombstone lifetime.

    Recovering from this AD replication error state can be ugly and time consuming, though we have methods to ultimately resolve it. Besides the initial AD replication failure due to the replication quarantine for replication partners that have not replicated for a period greater than that the tombstone lifetime the chances are very high that you will experience lingering objects. So the operative word here is prevention.

    What we could have done to prevent or limit the DC’s from experiencing a large time offset? Here are some ideas:

    1. Set the motherboard BIOS time to the current date and time before booting the operating system by powering the machine up and selecting the BIOS or system configuration settings.

    2. Transfer or even seize the PDCe FSMO role to a different machine before reintroducing the PDCe with the replaced motherboard to the environment.

    3. Implement KB 884776 - How to configure the Windows Time service against a large time offset. This will effectively prevent a machine from correcting it time offset beyond the hard upper and lower limits. If this is in place on the DC’s, Servers, and clients we would not see this scenario as a big problem.

    One final comment concerning the circumstances of this issue occurring; anything that can change the time on a DC can cause this issue. BIOS on the motherboard being reset, the BIOS battery going bad, a poorly patched DC getting a virus which flips the time, a router or hardware based time solution being used as the central Network Time Protocol (NTP) time source, etc…

    Hopefully you will never experience this type of issue since with a little forethought and configuration you will be able to completely prevent an otherwise difficult situation.

    - Todd Maxey

  • SSL/TLS Record Fragmentation Support

    This is Jonathan Stephens from the Directory Services team, and I wanted to share with you a recent interoperability issue I encountered. An admin had set up an Apache web server with the OpenSSL mod for SSL/TLS support. Users were able to connect to the secure web site using Firefox, but when they tried to use Internet Explorer the connection failed with the following error: The page cannot be displayed. We were asked to investigate what was happening and fix it if possible.

    When connecting to an SSL-enabled web site with Internet Explorer, the client and server must negotiate an SSL session during a process called the SSL (or TLS) Handshake. The client and server exchange what are called records, each record containing information relevant to a step in the negotiation process. Describing the entire Handshake process is beyond the scope of this post, but you can find more information here.

    Note: SSL 3.0 is a proprietary protocol developed by Netscape Communications. TLS 1.0 is an Internet Standard (RFC 2246) based upon that proprietary protocol. Functionally, there is little difference between SSL 3.0 and TLS 1.0, and for the purposes of this discussion the two are identical.

    As part of the handshake process, the server sends its list of trusted root certificates to the client in the form of a non-encrypted record. This is done so that if the server requires that the client have a digital certificate for authentication, the client is able to select one that will chain up to a root certificate trusted by the server. While there is no defined limit on the number root certificates that can be in this list, there is a limitation on the size of the records exchanged between the client and the server. This limit is defined in RFC 2246 as 16,384 bytes.

    So how does the Handshake protocol handle those scenarios where the list of trusted root certificates exceeds 16,384 bytes? RFC 2246 describes a process called record fragmentation, where any data that would exceed the 16KB record limit is split across multiple fragments. These fragments must be merged into one record by the client in order to retrieve the data.

    Let’s set that aside for a moment and talk briefly about SSL/TLS in Windows. The SSL/TLS protocol is implemented as a security package in Windows; this package is called SChannel, and the associated library is schannel.dll. A Windows application that needs to support SSL/TLS as either a client or a server can use Windows-integrated authentication to leverage the capabilities of the SChannel security package. Two such applications are Internet Explorer (IE) and Internet Information Services (IIS), the Windows web server. Other non-Microsoft products may have their own implementations of SSL/TLS and so would not use SChannel.

    This is precisely what our admin discovered while he was investigating this issue. He found that while users were unable to connect to the web site with IE, they could connect successfully with a third party browser – Firefox.

    To understand exactly what was happening, we took a network trace between IE and the Apache server. In that trace, we could clearly see that the list of root certificates sent to the client by the server was split across two records. The first was 16,384 bytes and the second was 153 bytes.

    The problem here is that SChannel does not support record fragmentation. When receiving data split across multiple records SChannel is not able to merge the data, so when record fragmentation is encountered the Handshake will fail resulting in a failed connection. On the server side, for example IIS, SChannel will truncate data above 16,384 bytes in order to fit it into one record. There are other implementations of SSL/TLS that do support record fragmentation, such as OpenSSL and Firefox. This explains why this problem wasn’t seen when Firefox was used.

    In the vast majority of cases, this does not present a problem. Most of the record data exchanged during the Handshake process is considerably smaller than the 16KB limit defined in the RFC. The potential exception to this is the trusted root certificate list record. If a server trusts more than approximately 100 root certificates the root certificate list could exceed the 16KB limit. Please note the use of the word “approximately”. The actual number of root certificates can vary from environment to environment, and should be determined by testing. Microsoft cannot provide a precise number because limitation is based solely on the total size of the data in the record rather than the number of entries, which can vary in length.

    In the case of IIS, where SChannel is leveraged for the server side of the Handshake, SChannel will truncate the list of trusted root certificates as I mentioned above. This behavior is described in the following KB article:

    933430 Clients cannot make connections if you require client certificates on a Web site or if you use IAS in Windows Server 2003

    The above article describes a 12,288 byte limit for the root certificate list. The hotfix described in that article simply increased that limit to the full 16,384 byte limit defined by the RFC. In those cases, however, where the root certificate list exceeds 16KB, the list will still be truncated by SChannel before the record is sent from the server to the client.

    When using IIS, the above article describes some specific steps an administrator can take to work around this limitation in SChannel. In cases such as this one, where the web server supports fragmentation but the client does not, the only option is to reduce the number of trusted root certificates to get the size under the 16KB limit for a single record.

    In some environments, the lack of support for record fragmentation in SChannel can lead to interoperability problems – failed connections, invalid client certificates, etc. Identifying problems associated with fragmentation is pretty simple; analyzing a brief network trace is usually sufficient to pinpoint instances of fragmentation. As I stated earlier, we usually see this problem in relation to the number of root certificates that are trusted by the server, and currently, the only way we have to resolve this issue is to remove unneeded roots from the server side. We hope to eliminate this problem completely in a future version of Windows.

    UPDATE 8/25/2010: Someone pointed out that I should update this blog post to make clear that the "future version of Windows" referenced above is Windows 7. Sort of. In order to support interoperability with other implementations of SSL/TLS, Windows 7 and Windows Server 2008 R2 both support coalescing fragmented SSL/TLS records on the receiving side, but Windows does not support fragmenting records on the sending side. Any outbound record that exceeds 16KB will still be truncated as described above.

    - Jonathan Stephens

  • File Server Migration Toolkit (FSMT) 1.1 Released

    Ned here. The Remote File System developer team wanted us to let you know about the release of FSMT 1.1. Here's their 'press release'. :-) 




    Microsoft is glad to announce the release of Microsoft File Server Migration Toolkit 1.1.  With this version you will be able to migrate and consolidate shared folders from servers running Windows NT Server 4.0, Windows 2000 family of servers, Windows 2003 family of servers, Windows Server 2008, or Windows Storage Server 2008 to a server running Windows Server 2003, Windows Storage Server 2003, Windows Server 2008 or Windows Storage Server 2008.  


    This new version comes with support added to Windows Server 2003 as well as Windows Server 2008, both on x86 and x64 systems, and its available in 5 languages (English, French, German, Japanese and Spanish).


    FSMT 1.1 can be downloaded from the Microsoft Download Center site.


    To have more information about FSMT please visit the Microsoft File Server Migration Toolkit Web Site


    Read the updated FSMT Whitepaper, with FSMT 1.1 information 




    Make sure you stop by their blog at if you have questions or comments. 


    - Ned Pyle

  • Troubleshooting KCC Event Log Errors

    My name is David Everett and I’m a Support Escalation Engineer on the Directory Services Support team.

    I’m going to discuss a recent trend I’ve seen where Active Directory Replication appears to be fine but one DC only in one (or more) sites begins logging Knowledge Consistency Checker (KCC) Warning and Error events in the Directory Service event log. I included sample events below.

    For those not familiar with the KCC, it is a distributed application that runs on every domain controller. The KCC is responsible for creating the connections between domain controllers and collectively forms the replication topology. The KCC uses Active Directory data to determine where (from what source domain controller to what destination domain controller) to create these connections.

    In some cases these errors are logged all the time and in others they are logged at regular intervals and they clear on their own only to reappear like clockwork. Typically other DCs in the same site(s), perhaps even in the whole forest, report no KCC errors at all. In some cases the DC logging these errors have a small number of connection objects compared with their peer DCs in the same site:

    Event Type: Warning
    Event Source: NTDS KCC
    Event Category: (1)
    Event ID: 1566
    Date: 5/14/2008
    Time: 1:51:23 PM
    Computer: DC1X
    All domain controllers in the following site that can replicate the
    directory partition over this transport are currently unavailable.

    Directory partition:
    CN=IP,CN=Inter-Site Transports,CN=Sites,CN=Configuration,DC=contoso,DC=com


    Event Type: Error
    Event Source: NTDS KCC
    Event Category: (1)
    Event ID: 1311
    Date: 5/14/2008
    Time: 1:51:23 PM
    Computer: DC1X
    The Knowledge Consistency Checker (KCC) has detected problems with the
    following directory partition.

    Directory partition:

    There is insufficient site connectivity information in Active Directory
    Sites and Services for the KCC to create a spanning tree replication topology.
    Or, one or more domain controllers with this directory partition are unable
    to replicate the directory partition information. This is probably due to
    inaccessible domain controllers.

    User Action
    Use Active Directory Sites and Services to perform one of the following
    - Publish sufficient site connectivity information so that the KCC can
    determine a route by which this directory partition can reach this site. This is
    the preferred option.
    - Add a Connection object to a domain controller that contains the directory
    partition in this site from a domain controller that contains the same
    directory partition in another site.

    If neither of the Active Directory Sites and Services tasks correct this
    condition, see previous events logged by the KCC that identify the
    inaccessible domain controllers.

    In some cases this event is also seen; it suggests name resolution is working but a network port is blocked:

    Event Type: Warning
    Event Source: NTDS KCC
    Event Category: (1)
    Event ID: 1865
    Date: 5/14/2008
    Time: 1:51:23 PM
    Computer: DC1X
    The Knowledge Consistency Checker (KCC) was unable to form a complete
    spanning tree network topology. As a result, the following list of sites
    cannot be reached from the local site.


    If you encounter this issue it could be the DC logging the errors is hosting the Intersite Topology Generator (ISTG) role for its site. This role is responsible for maintaining all of the Inter-site connection objects for the site. This role polls each DC in its site for connection objects that have failed and if failures are reported by the peer DCs the ISTG logs these events indicating something is not right with connectivity.

    For those wondering what these events mean here is a quick rundown:

    • The 1311 event indicates the KCC couldn't connect up all the sites.
    • The 1566 event indicates the DC could not replicate from any server in the site identified in the event description.
    • When logged, the 1865 event contains secondary information about the failure to connect the sites and tells which sites are disconnected from the site where the KCC errors are occurring.

    Ok, I’ll get to the point and explain how to identify the root cause and correct this. These errors are pointing to a topology or a connectivity issue. Either there are not enough site links to connect all the sites or more likely network connectivity is failing for a number of reasons.

    If your network is not fully routed (the ability for any DC in the forest to perform an RPC bind to every other DC in the forest) make certain Bridge All Sites Links (BASL) is unchecked. If BASL is unchecked Site Links and/or Site Link Bridges must be configured. Site Links and Site Link Bridges provide the KCC with the information it needs to build connections over existing network routes. If the network is fully routed and you have BASL checked, fine.

    While the network routes may exist the ports needed for Active Directory to replicate must not be restricted.

    The assumption of this blog is these errors continue to be logged even though the site listed in the 1566 event has been added to a site link object and AD topology is correctly configured.

    To locate the source of the KCC events and identify the root cause, you need to execute the following commands while the KCC events are being logged.

    1) Identify the ISTG covering each site by running this command:

    repadmin /istg

    The output will list all sites in the forest and the ISTG for each site:

    repadmin running command /istg against server localhost

    Gathering topology from site Default-First-Site-Name (

                                       Site                                ISTG
    ================== =================
                                     SiteX                               DC1X
                                     SiteY                               DC1Y

    NOTE: Determine from the output if the DC logging these events (DC1X) is the ISTG or not.

    2) If the DC logging the events is the ISTG any one of the DCs in the same site as this ISTG could have connectivity issues to the site identified in the 1566 event. You can identify which DC(s) are failing to replicate from the site identified in the 1566 event by running this command which targets all DCs in the site that the ISTG logging the errors resides in. For example, DC1X is logging the events and it is the ISTG for siteX. To identify which DCs in siteX are failing to replicate from siteY run this command:

    repadmin /failcache site:siteX >siteX-failcache.txt

    The failcache output shows two DCs in siteX:

    repadmin running command /failcache against server

    ==== KCC CONNECTION FAILURES =========================== (none)

    ==== KCC LINK FAILURES ===============================     SiteY\DC1Y        
        DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473        
        No Failures.

    repadmin running command /failcache against server

    ==== KCC CONNECTION FAILURES =========================== (none)

    ==== KCC LINK FAILURES ===============================     SiteY\DC1Y        
        DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473         
        46 consecutive failures since 2008-08-12 22:14:39.
    SiteZ\DC1Z        DC object GUID: fh3h8bde-a928-466a-97b0-39a507acbe54        
        No Failures.

    The output above identifies the Destination DC as (DC2X) in siteX that is failing to inbound replicate from siteY. In some cases the DC name is not resolved and shows as a GUID ( If the DC name is not resolved determine the hostname of the Destination DC by pinging the fully qualified CNAME:


    NOTE: DC2X may or may not be logging Error events in its Directory Services event log like the DC1X the ISTG is.

    3) Logon to the Destination DC identified in the previous step and determine if RPC connectivity from the Destination DC to the Source DC (DC1Y) is working.

    repadmin /bind

    • If “repadmin /bind DC1Y” from the Destination DC succeeds:

    Run “repadmin /showrepl <Destination DC>” and examine the output to determine if Active Directory Replication is blocked. The reason for replication failure should be identified in the output. Take the appropriate corrective action to get replication working.

    • If “repadmin /bind DC1Y” from the Destination DC fails:

    Verify firewall rules are not interfering with connectivity between the Destination DC and the Source DC. If the port blockage between the Destination DC and the Source DC cannot be resolved, configure the other DCs in the site where the errors are logged to be Preferred Bridgeheads and force KCC to build new connection objects with the Preferred Bridgeheads only.

    NOTE: Running "repadmin /bind DC1Y” from the ISTG logging the KCC errors may reveal no connectivity issues to DC1Y in the remote site. As noted earlier, the ISTG is responsible for maintaining inter-site connectivity and may not be the DC having the problem. For this reason the command must be run from the Destination DC that repadmin /failcache identified as failing to inbound replicate

    A successful bind looks similar to this:

    C:\>repadmin /bind DC1Y
    Bind to DC1Y succeeded.
    NTDSAPI V1 BindState, printing extended members.
        bindAddr: DC1Y
    Extensions supported (cb=48):
        BASE                             : Yes
        ASYNCREPL                        : Yes
        REMOVEAPI                        : Yes
        MOVEREQ_V2                       : Yes
        GETCHG_COMPRESS                  : Yes
        DCINFO_V1                        : Yes
        RESTORE_USN_OPTIMIZATION         : Yes
        KCC_EXECUTE                      : Yes
        ADDENTRY_V2                      : Yes
        LINKED_VALUE_REPLICATION         : Yes
        DCINFO_V2                        : Yes
        CRYPTO_BIND                      : Yes
        GET_REPL_INFO                    : Yes
        STRONG_ENCRYPTION                : Yes
        DCINFO_VFFFFFFFF                 : Yes
        TRANSITIVE_MEMBERSHIP            : Yes
        ADD_SID_HISTORY                  : Yes
        POST_BETA3                       : Yes
        GET_MEMBERSHIPS2                 : Yes
        NONDOMAIN_NCS                    : Yes
        GETCHGREQ_V8 (WHISTLER BETA 1)   : Yes
        XPRESS_COMPRESSION               : Yes
        DRS_EXT_ADAM                     : No
    Site GUID: stn45bf5-f33f-4d53-9b1b-e7a0371f9a3d
    Repl epoch: 0
    Forest GUID: idk4734-eeca-11d2-a5d8-00805f9f21f5
    Security information on the binding is as follows:
        SPN Requested:  LDAP/DC1Y
        Authn Service:  9
        Authn Level:  6
        Authz Service:  0

    4) If these events occur at specific periods of the day or week and then they resolve on their own, verify DNS Scavenging is not set too aggressively. It could be DNS Scavenging is so aggressive that SRV, A, CNAME and other valid records are purged from DNS causing name resolution between DCs to fail. If this is the behavior you are seeing, verify scavenging settings on these DNS zones:

    • Scavenging settings need to be checked on child domains if the Source or Destination DCs are in child domains.

    Example: if Scavenging is set this way the outage will occur every 24 hours:

    Non-refresh period: 8 hours
    Refresh period: 8 hours
    Scavenging period: 8 hours

    To correct this change the Refresh and Non-refresh periods to 1 day each and set scavenging to 3 days. See Managing the aging and scavenging of server data on Technet to configure these settings for the DNS Server and/or zones.

    Hopefully this clears up the mysterious KCC errors on that one DC.

    - David Everett