Microsoft's official enterprise support blog for AD DS and more
Ned here again. Today I’m going to talk about a couple of scenarios we run into with the ConflictAndDeleted folder in DFSR. These are real quick and dirty, but they may save you a call to us someday.
Scenario 1: We need to empty out the ConflictAndDeleted folder in a controlled manner as part of regular administration (i.e. we just lowered quota and we want to reclaim that space).
Scenario 2: The ConflictAndDeleted folder quota is not being honored due to an error condition and the folder is filling the drive.
Let’s walk through these now.
Emptying the folder normally
It’s possible to clean up the ConflictAndDeleted folder through the DFSMGMT.MSC and SERVICES.EXE snap-ins, but it’s disruptive and kind of gross (you could lower the quota, wait for AD replication, wait for DFSR polling, and then restart the DFSR service). A much faster and slicker way is to call the WMI method CleanupConflictDirectory from the command-line or a script:
1. Open a CMD prompt as an administrator on the DFSR server. 2. Get the GUID of the Replicated Folder you want to clean:
WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderconfig get replicatedfolderguid,replicatedfoldername
(This is all one line, wrapped)
Example output:
3. Then call the CleanupConflictDirectory method:
WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where "replicatedfolderguid='<RF GUID>'" call cleanupconflictdirectory
Example output with a sample GUID:
WMIC.EXE /namespace:\\root\microsoftdfs path dfsrreplicatedfolderinfo where "replicatedfolderguid='70bebd41-d5ae-4524-b7df-4eadb89e511e'" call cleanupconflictdirectory
4. At this point the ConflictAndDeleted folder will be empty and the ConflictAndDeletedManifest.xml will be deleted.
Emptying the ConflictAndDeleted folder when in an error state
We’ve also seen a few cases where the ConflictAndDeleted quota was not being honored at all. In every single one of those cases, the customer had recently had hardware problems (specifically with their disk system) where files had become corrupt and the disk was unstable – even after repairing the disk (at least to the best of their knowledge), the ConflictAndDeleted folder quota was not being honored by DFSR.
Here’s where quota is set:
Usually when we see this problem, the ConflictAndDeletedManifest.XML file has grown to hundreds of MB in size. When you try to open the file in an XML parser or in Internet Explorer, you will receive an error like “The XML page cannot be displayed” or that there is an error at line X. This is because the file is invalid at some section (with a damaged element, scrambled data, etc).
To fix this issue:
For a bit more info on conflict and deletion handling in DFSR, take a look at:
Staging folders and Conflict and Deleted folders (TechNet) DfsrConflictInfo Class (MSDN)
Until next time...
- Ned "Unhealthy love for DFSR" Pyle
My name is David Everett and I’m a Support Escalation Engineer on the Directory Services Support team.
I’m going to discuss a recent trend I’ve seen where Active Directory Replication appears to be fine but one DC only in one (or more) sites begins logging Knowledge Consistency Checker (KCC) Warning and Error events in the Directory Service event log. I included sample events below.
For those not familiar with the KCC, it is a distributed application that runs on every domain controller. The KCC is responsible for creating the connections between domain controllers and collectively forms the replication topology. The KCC uses Active Directory data to determine where (from what source domain controller to what destination domain controller) to create these connections.
In some cases these errors are logged all the time and in others they are logged at regular intervals and they clear on their own only to reappear like clockwork. Typically other DCs in the same site(s), perhaps even in the whole forest, report no KCC errors at all. In some cases the DC logging these errors have a small number of connection objects compared with their peer DCs in the same site:
Event Type: Warning Event Source: NTDS KCC Event Category: (1) Event ID: 1566 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITY\ANONYMOUS LOGON Computer: DC1X Description: All domain controllers in the following site that can replicate the directory partition over this transport are currently unavailable.
Site: CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com Directory partition: CN=Configuration,DC=contoso,DC=com Transport: CN=IP,CN=Inter-Site Transports,CN=Sites,CN=Configuration,DC=contoso,DC=com
-AND-
Event Type: Error Event Source: NTDS KCC Event Category: (1) Event ID: 1311 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITY\ANONYMOUS LOGON Computer: DC1X Description: The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition.
Directory partition: CN=Configuration,DC=contoso,DC=com
There is insufficient site connectivity information in Active Directory Sites and Services for the KCC to create a spanning tree replication topology. Or, one or more domain controllers with this directory partition are unable to replicate the directory partition information. This is probably due to inaccessible domain controllers.
User Action Use Active Directory Sites and Services to perform one of the following actions: - Publish sufficient site connectivity information so that the KCC can determine a route by which this directory partition can reach this site. This is the preferred option. - Add a Connection object to a domain controller that contains the directory partition in this site from a domain controller that contains the same directory partition in another site.
If neither of the Active Directory Sites and Services tasks correct this condition, see previous events logged by the KCC that identify the inaccessible domain controllers.
In some cases this event is also seen; it suggests name resolution is working but a network port is blocked:
Event Type: Warning Event Source: NTDS KCC Event Category: (1) Event ID: 1865 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITY\ANONYMOUS LOGON Computer: DC1X Description: The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.
Sites: CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com
If you encounter this issue it could be the DC logging the errors is hosting the Intersite Topology Generator (ISTG) role for its site. This role is responsible for maintaining all of the Inter-site connection objects for the site. This role polls each DC in its site for connection objects that have failed and if failures are reported by the peer DCs the ISTG logs these events indicating something is not right with connectivity.
For those wondering what these events mean here is a quick rundown:
Ok, I’ll get to the point and explain how to identify the root cause and correct this. These errors are pointing to a topology or a connectivity issue. Either there are not enough site links to connect all the sites or more likely network connectivity is failing for a number of reasons.
If your network is not fully routed (the ability for any DC in the forest to perform an RPC bind to every other DC in the forest) make certain Bridge All Sites Links (BASL) is unchecked. If BASL is unchecked Site Links and/or Site Link Bridges must be configured. Site Links and Site Link Bridges provide the KCC with the information it needs to build connections over existing network routes. If the network is fully routed and you have BASL checked, fine.
While the network routes may exist the ports needed for Active Directory to replicate must not be restricted.
The assumption of this blog is these errors continue to be logged even though the site listed in the 1566 event has been added to a site link object and AD topology is correctly configured.
To locate the source of the KCC events and identify the root cause, you need to execute the following commands while the KCC events are being logged.
1) Identify the ISTG covering each site by running this command:
repadmin /istg
The output will list all sites in the forest and the ISTG for each site:
repadmin running command /istg against server localhost Gathering topology from site Default-First-Site-Name (DC1.contoso.com): Site ISTG ================== ================= SiteX DC1X SiteY DC1Y
repadmin running command /istg against server localhost
Gathering topology from site Default-First-Site-Name (DC1.contoso.com):
Site ISTG ================== ================= SiteX DC1X SiteY DC1Y
NOTE: Determine from the output if the DC logging these events (DC1X) is the ISTG or not.
2) If the DC logging the events is the ISTG any one of the DCs in the same site as this ISTG could have connectivity issues to the site identified in the 1566 event. You can identify which DC(s) are failing to replicate from the site identified in the 1566 event by running this command which targets all DCs in the site that the ISTG logging the errors resides in. For example, DC1X is logging the events and it is the ISTG for siteX. To identify which DCs in siteX are failing to replicate from siteY run this command:
repadmin /failcache site:siteX >siteX-failcache.txt
The failcache output shows two DCs in siteX:
repadmin running command /failcache against server DC1X._msdcs.contoso.com ==== KCC CONNECTION FAILURES =========================== (none) ==== KCC LINK FAILURES =============================== SiteY\DC1Y DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473 No Failures. repadmin running command /failcache against server DC2X._msdcs.contoso.com ==== KCC CONNECTION FAILURES =========================== (none) ==== KCC LINK FAILURES =============================== SiteY\DC1Y DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473 46 consecutive failures since 2008-08-12 22:14:39. SiteZ\DC1Z DC object GUID: fh3h8bde-a928-466a-97b0-39a507acbe54 No Failures.
repadmin running command /failcache against server DC1X._msdcs.contoso.com ==== KCC CONNECTION FAILURES =========================== (none)
==== KCC LINK FAILURES =============================== SiteY\DC1Y DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473 No Failures.
repadmin running command /failcache against server DC2X._msdcs.contoso.com ==== KCC CONNECTION FAILURES =========================== (none) ==== KCC LINK FAILURES =============================== SiteY\DC1Y DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473 46 consecutive failures since 2008-08-12 22:14:39. SiteZ\DC1Z DC object GUID: fh3h8bde-a928-466a-97b0-39a507acbe54 No Failures.
The output above identifies the Destination DC as (DC2X) in siteX that is failing to inbound replicate from siteY. In some cases the DC name is not resolved and shows as a GUID (s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com). If the DC name is not resolved determine the hostname of the Destination DC by pinging the fully qualified CNAME:
ping s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com
NOTE: DC2X may or may not be logging Error events in its Directory Services event log like the DC1X the ISTG is.
3) Logon to the Destination DC identified in the previous step and determine if RPC connectivity from the Destination DC to the Source DC (DC1Y) is working.
repadmin /bind DC1Y.contoso.com
Run “repadmin /showrepl <Destination DC>” and examine the output to determine if Active Directory Replication is blocked. The reason for replication failure should be identified in the output. Take the appropriate corrective action to get replication working.
Verify firewall rules are not interfering with connectivity between the Destination DC and the Source DC. If the port blockage between the Destination DC and the Source DC cannot be resolved, configure the other DCs in the site where the errors are logged to be Preferred Bridgeheads and force KCC to build new connection objects with the Preferred Bridgeheads only.
NOTE: Running "repadmin /bind DC1Y” from the ISTG logging the KCC errors may reveal no connectivity issues to DC1Y in the remote site. As noted earlier, the ISTG is responsible for maintaining inter-site connectivity and may not be the DC having the problem. For this reason the command must be run from the Destination DC that repadmin /failcache identified as failing to inbound replicate
A successful bind looks similar to this:
C:\>repadmin /bind DC1Y Bind to DC1Y succeeded. NTDSAPI V1 BindState, printing extended members. bindAddr: DC1Y Extensions supported (cb=48): BASE : Yes ASYNCREPL : Yes REMOVEAPI : Yes MOVEREQ_V2 : Yes GETCHG_COMPRESS : Yes DCINFO_V1 : Yes RESTORE_USN_OPTIMIZATION : Yes KCC_EXECUTE : Yes ADDENTRY_V2 : Yes LINKED_VALUE_REPLICATION : Yes DCINFO_V2 : Yes INSTANCE_TYPE_NOT_REQ_ON_MOD : Yes CRYPTO_BIND : Yes GET_REPL_INFO : Yes STRONG_ENCRYPTION : Yes DCINFO_VFFFFFFFF : Yes TRANSITIVE_MEMBERSHIP : Yes ADD_SID_HISTORY : Yes POST_BETA3 : Yes GET_MEMBERSHIPS2 : Yes GETCHGREQ_V6 (WHISTLER PREVIEW) : Yes NONDOMAIN_NCS : Yes GETCHGREQ_V8 (WHISTLER BETA 1) : Yes GETCHGREPLY_V5 (WHISTLER BETA 2) : Yes GETCHGREPLY_V6 (WHISTLER BETA 2) : Yes ADDENTRYREPLY_V3 (WHISTLER BETA 3): Yes GETCHGREPLY_V7 (WHISTLER BETA 3) : Yes VERIFY_OBJECT (WHISTLER BETA 3) : Yes XPRESS_COMPRESSION : Yes DRS_EXT_ADAM : No Site GUID: stn45bf5-f33f-4d53-9b1b-e7a0371f9a3d Repl epoch: 0 Forest GUID: idk4734-eeca-11d2-a5d8-00805f9f21f5 Security information on the binding is as follows: SPN Requested: LDAP/DC1Y Authn Service: 9 Authn Level: 6 Authz Service: 0
4) If these events occur at specific periods of the day or week and then they resolve on their own, verify DNS Scavenging is not set too aggressively. It could be DNS Scavenging is so aggressive that SRV, A, CNAME and other valid records are purged from DNS causing name resolution between DCs to fail. If this is the behavior you are seeing, verify scavenging settings on these DNS zones:
Example: if Scavenging is set this way the outage will occur every 24 hours:
Non-refresh period: 8 hours Refresh period: 8 hours Scavenging period: 8 hours
To correct this change the Refresh and Non-refresh periods to 1 day each and set scavenging to 3 days. See Managing the aging and scavenging of server data on Technet to configure these settings for the DNS Server and/or zones.
Hopefully this clears up the mysterious KCC errors on that one DC.
- David Everett
Ned here again. In the course of using Windows, it is occasionally useful to be someone besides… you. Maybe you need to be an Administrator temporarily in order to fix a problem. Or maybe you need to be a different user as only they seem to have a problem. Or maybe, just maybe, you want to be the operating system itself.
Ehhh-whhhaaaaa?
Think about it. What if you are troubleshooting a problem where an agent process like the SMS Client isn’t working? Or an anti-virus service is having issues reading the registry? If only we had some way to look at things while logged in as SYSTEM.
What is SYSTEM and why is Vista/2008 special?
SYSTEM is actually an account; in fact, it’s a real honest-to-goodness user. Its real name is “NT Authority\Local System” and it has a well-known SID of S-1-5-18. All Windows computers have this account and they always have the same SID. It’s there for user-mode processes that will be executed as the OS itself.
This is a bit tricky in Windows Vista and Windows Server 2008 though. In previous operating systems you could simply start a scheduled task CMD prompt and have it interact with the desktop easily. This was construed as a security hole to some people, so in Vista/2008 it’s not possible anymore.
So how can we take off our glasses and put on the cape with the big red S?
Method one - PSEXEC
An easy way to get a CMD prompt as SYSTEM is to grab PSEXEC from Microsoft Sysinternals:
1. Download PSEXEC and unzip to some folder. 2. Open an elevated CMD prompt as an administrator. 3. Navigate to the folder where you unzipped PSEXEC.EXE 4. Run: PSEXEC -i -s -d CMD 5. You will have a new CMD prompt open, as though by magic. 6. Type the following in the new CMD prompt to prove who you are: WHOAMI /USER
1. Download PSEXEC and unzip to some folder.
2. Open an elevated CMD prompt as an administrator.
3. Navigate to the folder where you unzipped PSEXEC.EXE
4. Run:
PSEXEC -i -s -d CMD
5. You will have a new CMD prompt open, as though by magic.
6. Type the following in the new CMD prompt to prove who you are:
WHOAMI /USER
There you go – anything that happens in that CMD prompt or is spawned from that prompt will be running as SYSTEM. You could run regedit from here, start explorer, or whatever you need to troubleshoot as that account.
That was pretty easy – why do I have some more methods below? Unfortunately, in several previous versions of the PSEXEC tool the –s (system) switch has not worked. As of version 1.94 it does work again, but that is no guarantee for the future. This brings us to a more iron-clad technique:
Method two - REMOTE
We can use the REMOTE.EXE tool which comes as part of the Windows Debugger. While it’s a bit more cumbersome, it will always work:
1. Download the Windows Debugger (x86 or x64) and install it anywhere (we just need its copy of REMOTE.EXE, so feel free to copy that file elsewhere and uninstall the debugger when done; in the example below I installed to “c:\debuggers”). 2. Open an elevated CMD prompt as an administrator. 3. Run: AT <one minute from now> c:\debuggers\remote.exe /s cmd SYSCMD Where you use 24-hour clock notation (aka ‘military time’). For example, right now it is 3:57PM, so I type: AT 15:58 c:\debuggers\REMOTE.EXE /s cmd SYSCMD 4. Then once 15:58 (3:38PM) is reached, you can run: C:\debuggers\REMOTE.EXE /c <your computer> SYSCMD Where you are typing your computers’ own NetBIOS name. So for example: C:\debuggers\remote.exe /c nedpyle04 SYSCMD
1. Download the Windows Debugger (x86 or x64) and install it anywhere (we just need its copy of REMOTE.EXE, so feel free to copy that file elsewhere and uninstall the debugger when done; in the example below I installed to “c:\debuggers”).
3. Run:
AT <one minute from now> c:\debuggers\remote.exe /s cmd SYSCMD
Where you use 24-hour clock notation (aka ‘military time’). For example, right now it is 3:57PM, so I type:
AT 15:58 c:\debuggers\REMOTE.EXE /s cmd SYSCMD
4. Then once 15:58 (3:38PM) is reached, you can run:
C:\debuggers\REMOTE.EXE /c <your computer> SYSCMD
Where you are typing your computers’ own NetBIOS name. So for example:
C:\debuggers\remote.exe /c nedpyle04 SYSCMD
Neato. I used REMOTE to connect to REMOTE on the same computer. This is a good example of a client-server RPC application. The SYSCMD option I keep using is just a marker that identifies the remote session. Technically you could have lots of these going at once, each with a different marker.
If I then use WHOAMI /USER again, the proof:
To leave just type EXIT
Method two and a half – REMOTE and the Task Scheduler
Maybe you want to have REMOTE ready to go at a moment’s notice (you plan to do this a lot, eh)? Or what if you want to use one of the other SYSTEM-type accounts, like “Local Service” and “Network Service”? PSEXEC can’t do that and neither can the old AT command.
Here’s some XML and commands you can use to make the server portion of REMOTE be ready at an instant for various accounts. This time we’ll use the newer, slicker SCHTASKS tool:
1. Copy the following sample into notepad and save as <something>.xml (in my sample below, I save to “c:\temp\RaS.xml”) <?xml version="1.0" encoding="UTF-16"?> <Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task"> <RegistrationInfo> <Date>2008-03-26T16:40:47.4520087</Date> <Author>CONTOSO\Administrator</Author> </RegistrationInfo> <Triggers /> <Principals> <Principal id="Author"> <UserId>SYSTEM</UserId> <RunLevel>HighestAvailable</RunLevel> </Principal> </Principals> <Settings> <IdleSettings> <Duration>PT10M</Duration> <WaitTimeout>PT1H</WaitTimeout> <StopOnIdleEnd>true</StopOnIdleEnd> <RestartOnIdle>false</RestartOnIdle> </IdleSettings> <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy> <DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries> <StopIfGoingOnBatteries>true</StopIfGoingOnBatteries> <AllowHardTerminate>false</AllowHardTerminate> <StartWhenAvailable>false</StartWhenAvailable> <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable> <AllowStartOnDemand>true</AllowStartOnDemand> <Enabled>true</Enabled> <Hidden>false</Hidden> <RunOnlyIfIdle>false</RunOnlyIfIdle> <WakeToRun>false</WakeToRun> <ExecutionTimeLimit>PT0S</ExecutionTimeLimit> <Priority>7</Priority> </Settings> <Actions Context="Author"> <Exec> <Command>"C:\debuggers\remote.exe"</Command> <Arguments>/s cmd SYSCMD</Arguments> <WorkingDirectory>C:\debuggers</WorkingDirectory> </Exec> </Actions> </Task>
1. Copy the following sample into notepad and save as <something>.xml (in my sample below, I save to “c:\temp\RaS.xml”)
<?xml version="1.0" encoding="UTF-16"?> <Task version="1.2" xmlns="http://schemas.microsoft.com/windows/2004/02/mit/task"> <RegistrationInfo> <Date>2008-03-26T16:40:47.4520087</Date> <Author>CONTOSO\Administrator</Author> </RegistrationInfo> <Triggers /> <Principals> <Principal id="Author"> <UserId>SYSTEM</UserId> <RunLevel>HighestAvailable</RunLevel> </Principal> </Principals> <Settings> <IdleSettings> <Duration>PT10M</Duration> <WaitTimeout>PT1H</WaitTimeout> <StopOnIdleEnd>true</StopOnIdleEnd> <RestartOnIdle>false</RestartOnIdle> </IdleSettings> <MultipleInstancesPolicy>IgnoreNew</MultipleInstancesPolicy> <DisallowStartIfOnBatteries>false</DisallowStartIfOnBatteries> <StopIfGoingOnBatteries>true</StopIfGoingOnBatteries> <AllowHardTerminate>false</AllowHardTerminate> <StartWhenAvailable>false</StartWhenAvailable> <RunOnlyIfNetworkAvailable>false</RunOnlyIfNetworkAvailable> <AllowStartOnDemand>true</AllowStartOnDemand> <Enabled>true</Enabled> <Hidden>false</Hidden> <RunOnlyIfIdle>false</RunOnlyIfIdle> <WakeToRun>false</WakeToRun> <ExecutionTimeLimit>PT0S</ExecutionTimeLimit> <Priority>7</Priority> </Settings> <Actions Context="Author"> <Exec> <Command>"C:\debuggers\remote.exe"</Command> <Arguments>/s cmd SYSCMD</Arguments> <WorkingDirectory>C:\debuggers</WorkingDirectory> </Exec> </Actions> </Task>
Note the highlighted elements above. You will need to make sure that these paths match where REMOTE.EXE is located. Also, the UserID can be set to anything you like, including “nt authority\local service” or “net authority\network service”.
2. Open an elevated CMD prompt as an administrator. 3. Run: SCHTASKS /create /tn <some task name> /xml <path to xml file> Where you provide a real task name and XML file. For example: SCHTASKS /create /tn RemoteAsSystem /xml c:\temp\RaS.xml
SCHTASKS /create /tn <some task name> /xml <path to xml file>
Where you provide a real task name and XML file. For example:
SCHTASKS /create /tn RemoteAsSystem /xml c:\temp\RaS.xml
4. This created a scheduled task with all the REMOTE info filled out. 5. Now we can run the REMOTE server piece anytime we want, as often as we want with: SCHTASKS /run /tn RemoteAsSystem
4. This created a scheduled task with all the REMOTE info filled out.
5. Now we can run the REMOTE server piece anytime we want, as often as we want with:
SCHTASKS /run /tn RemoteAsSystem
6. Now we can connect with just like we did back in method two: REMOTE /c % COMPUTERNAME% SYSCMD
6. Now we can connect with just like we did back in method two:
REMOTE /c % COMPUTERNAME% SYSCMD
That’s it. Hopefully you find this useful someday (or maybe I should hope you never have to find it useful). Got a comment, or another way to do this? Let us know.
- Ned “Nubbin” Pyle
Hi, David here. Today I wanted to talk about something that we see all the time here in Directory Services, but that doesn’t usually get a lot of press. It’s a condition we call port exhaustion, and it’s a problem that will cause TCP and UDP communications with other machines over the network to fail.
Port exhaustion can cause all kinds of problems for your servers. Here’s a list of some symptoms:
- Users won’t be able to connect to file shares on a remote server - DNS name registration might fail - Authentication might fail - Trust operations might fail between domain controllers - Replication might fail between domain controllers - MMC consoles won’t work or won’t be able to connect to remote servers.
That’s just a sample of the most common symptoms that we see. But here’s the big one: You reboot the server(s) involved, and the problem goes away - temporarily. A few hours or a few days later, it comes back.
So what is port exhaustion? You might think that it’s where the ports on the computer get tired and just start responding slower over time – but, well, computers aren’t human, and they certainly aren’t supposed to get tired. The truth is much more insidious. What port exhaustion really means is that we don’t have any more ports available for communication.
Now, some administrators out there are going to suspect a memory leak of some kind when this problem happens, and it’s true that memory leaks can cause the same type of issues (I’ll explain why in a moment). But usually we find that most of the time, memory isn’t the issue, and you can end up trying to troubleshoot memory problems that aren’t there.
In order to understand port exhaustion, you need to first understand that everything I listed above requires servers to be able to initiate outbound connections to other servers. It’s the word outbound that’s important. We usually think of network connectivity requirements in inbound terms – our clients need to connect to a server on a specific TCP or UDP port, like port 80 for web browsing or port 445 for file shares (SMB). But we very rarely think about the other side of that, which is that the communication has to have a source port available to use.
As you might know, there are 65,535 ports available for TCP and UDP connections in TCP/IP. The first 1024 of those are reserved for specific services and protocols to use as senders or listeners. For example, DHCP requests will always come from port 67 on a client, and the DHCP service (the server component) always listens on port 68. That means that they listen on these ports for inbound communications. Beyond that, ports get dynamically assigned to services and applications for either inbound or outbound use as needed. A port can normally only do one thing – we can either use it to listen for connections from other machines on the network, or we can use it to initiate connections to other machines on the network, but we usually can’t do both (some services cheat and use ports bi-directionally, but this is relatively rare).
So 65535–1024 is still 64511 ports. That’s a lot! We should almost never run out, right? You’d think so, but there’s another limitation here that you might not be aware of, and that limitation is that we don’t actually use the full range of ports for any dynamic communications. Dynamic communication is any sort of network communication that doesn’t already have a port specifically reserved for sending or receiving it – in other words, the vast majority of network traffic that a Windows computer generates.
By default in the Windows operating system, we only have a limited number of ports available for outbound communications. We sometimes call these user ports, because user-mode processes are what we really expect to be using these things most often. For example, when you connect to a file server to access a file, you’re connecting to (usually) either port 445 or port 139 on the other side to retrieve that file. However, in order to negotiate the session, you need a client port on your computer to use for this, and so the application making the connection (Windows Explorer, in the case of browsing files) gets a dynamically-assigned port to use.
Since we only have a limited number of ports available by default, you can run out of them – and when you run out, you’re no longer able to make new outbound connections from your computer to other computers on the network. This can cause an awful lot of communication to break down – including the communication that’s needed to authenticate users with Kerberos.
In Windows XP/2003 (and earlier) the dynamic port range that we use for this was 1024-5000 by default. So, you had a little less than 4000 ports available for outbound network communication. Ports above that range were generally reserved for application listeners. In Windows Vista and 2008, we changed that range to be more in line with IANA recommendations. If you’re curious, you can read the KB article here. The upshot of the changes is that we actually have a larger default dynamic range in Vista and 2008, but we also messed up everyone who’s ever configured internal firewalls to block high ports (which by the way is something we don’t recommend doing on an internal network. Either way, the end result is that you’ve got a few more ports available to use by default in Vista and 2008.
Even so, it’s still possible to run out of ports. And when this happens, communication starts to break down. We run into this scenario a lot more often than you might think, and it causes the types of issues I detailed above. 99% of the time when someone has this problem, it happens because an application has been grabbing those ports and not releasing them properly. So, over time, it uses up more and more ports from the dynamic range until we run out.
In most networks there are potentially dozens, if not hundreds, of different applications that might be communicating with other servers over the network – security tools, management and monitoring tools, line of business applications, internal server processes, and so on. So when you have a problem like this, narrowing down which application is causing the problem can be a challenge. Fortunately, there are a couple of tools that make this easier, and the best part is, they come with the operating system.
The first tool is NETSTAT. Netstat queries the network stack and shows you the state of your network connection, including the ports you’re using. Netstat can tell you which ports are in use, where the communication is going, and what application has the port open.
Another cool tool is Port Reporter. Port Reporter does everything that Netstat does, but it runs in real-time rather than just a point-in-time snapshot like Netstat does. Netstat is included in Windows, but you can download Port Reporter for free from our website. (All my examples in this blog will use Netstat).
So, if you suspect that you might have a port exhaustion problem, then you’d want to run this command:
netstat –anob > netstat.txt
This runs Netstat and dumps the output to a text file. You’d want to use a text file since trying to look at the output inside a command prompt is a quick way to give yourself a migraine. Once you’ve done this, you can examine the text file output, and you’ll be able to see what processes are using up ports. What you want to look for is entries where the same process is using a lot of ports on the machine. That is the most likely culprit.
Here’s an example of what you get with netstat (I’ve snipped it for length)
C:\Windows\System32\netstat –ano
Notice that you can see the port you’re using locally, the one you’re talking to remotely, and what the state of the connection is. You can also get the process ID (that’s the o switch in the netstat command), and you can even have netstat try to grab the name of the process (use netstat –anob).
C:\Windows\System32\netstat -anob
What you’re looking for in the output is a single process that is using up a large number of ports locally. So for example, on my machine above we can see that PID 608 is using several ports. Usually what will happen when you run into port exhaustion is that you will see that one (or two) processes are using 90-95% of the dynamic range. The other piece of information to look at is where they’re talking to remotely, and what the state of the connection is. So, if you see a process that’s using up a lot of ports, talking to a single remote address or several remote addresses, and the state of the connection is something like TIME_WAIT, that’s usually a dead giveaway that this process is having a problem and not releasing those ports properly.
Once you have this information, you can usually get things working again by turning off the offending process – but that’s only a temporary fix. Odds are, whatever was causing the problem was a legitimate piece of software that you want to have running. Usually when you get to this stage we recommend contacting the vendor of that application, or taking a look at whatever other servers the application might be communicating with, in order to get a permanent fix.
I mentioned above that memory leaks can cause this behavior too – why is that exactly? What happens is that in order to get a port to use for an outbound connection, processes need to acquire a handle to that port. That handle comes out of non-paged pool memory. So, if you have a memory leak, and you run out of non-paged pool, processes that need to talk to other machines on the network won’t be able to get the handle, and therefore won’t be able to get the port they need. So if you’re looking at that Netstat output and you’re just not seeing anything useful, you might still have a memory issue on the server.
At this point you really should be contacting us, since finding and fixing it is going to require some debugging. Cases that get this far are rare however, and most of the time, the Netstat output is going to give you the smoking gun you need to find the offending piece of software.
- David “Fallout 3 Rules” Beach
Hi, Gary from Directory Services here and I’m going to talk today about the concept of “lag sites” or “hot sites” as a recovery strategy. I recently had a case where the customer asked if the replication interval for a site link could be set higher than 10,080 minutes (7 days). The quick answer was that Active Directory only supports values from 15 up to 10,080 minutes and the schedule is based on a week. If the replinterval attribute on the site link is manually set to something lower than 15 it will use the default of 15. If it is set to something higher than 10,080, it will be ignored and 10,080 will be used.
But the underlying question kept coming back to the recommendation of a latent “lag site”.
First let me give a quick definition of a lag site or hot site and its general intended purpose. A lag site is just an Active Directory site that is configured with a replication schedule of one, two or maybe three days out of the week. That way it will have data that would be intentionally out-of-date as of the last successful inbound replication. It is sometimes used as a quick way to recover accidentally deleted objects without having to resort to finding the most recent successful backup within the tombstone lifetime of the domain that has the data.
This sounds like a decent idea, in theory. However, Microsoft Support does not recommend a lag site as a disaster recovery strategy. Servicing products such as hotfixes and service packs not recognize quasi-offline DC state monitoring software may also detect the state of a lag site DC as malfunctioning and attempt to re-enable it (or tell an unwitting administrator to do so). Microsoft makes no guarantees that the servicing and monitoring products would not re-enable Netlogon and KDC services in a lag site. In addition, other Microsoft products, such as Exchange Server, are not designed to operate in a lag site and they may not function properly with lag site DCs.
The following lists some reasons why lag sites should not be relied upon as a disaster recovery strategy, especially in lieu of proper Active Directory System State backups:
Lag sites are not guaranteed to be intact in a disaster:
Replicating from lag site might have unrecoverable consequences:
Lag sites pose security threats to the corporate environment:
Careful consideration must be put in configuring and deploying lag sites:
The above list is not exhaustive, and there could be other unseen problems with deploying lag sites as a disaster recovery strategy. It has always been strongly recommended that the best way to prepare for disasters such as mass deletions, mass password changes, etc. is to backup domain controllers daily and verify these backups regularly through test restorations.
Finally, keep in mind that testing your disaster recovery routine is vital both prior to beginning to rely on that routine in case of failure as well as once you begin to use it as your recovery strategy. Surprise is never good when a disaster strikes.
Here are some links to Microsoft recommended recovery steps and practices:
840001 How to restore deleted user accounts and their group memberships in Active Directory - http://support.microsoft.com/kb/840001
Useful shelf life of a system-state backup of Active Directory - http://support.microsoft.com/kb/216993
Managing Active Directory Backup and Restore - http://technet2.microsoft.com/windowsserver/en/library/5d683eeb-e76c-46e9-92f4-fcb2a10f955f1033.mspx
Step-by-Step Guide for Windows Server 2008 AD DS Backup and Recovery - http://technet.microsoft.com/en-us/library/cc771290.aspx
Active Directory Backup and Restore in Windows Server 2008 - http://technet.microsoft.com/en-us/magazine/cc462796.aspx
- Gary Mudgett
Ned here. Our developer team colleagues at the File Cabinet have posted an interesting article on the DFSDIAG tool. Introduced with Windows Server 2008, this utility is excellent for testing, documenting, and troubleshooting your DFS Namespaces environment. Make sure you give the article a read.
What Does DFSDIAG Do? (FileCabinet Blog)
PS: not be confused with the DFSRDIAG tool, which is used with DFSR. Don't worry, I do it all the time myself. :-)
- Ned Pyle
Greetings, Todd here and I wanted to take a few moments to talk to you about an issue that arises from time to time. I will start this time-related issue exploration with a worst case scenario.
The Primary Domain Controller Emulator (also known as the PDCe) in your forest root has a hardware issue which requires the replacement of the motherboard or even the replacement of the machine due to theft, fire, water based fire suppression system damage, etc, etc… The motherboard or machine is replaced and the machine is started.
Sometime shortly after the motherboards or system replacement it is noted that AD replication is failing, everywhere. The error you are receiving is:
Event Type: Error Event Source: NTDS Replication Event Category: Replication Event ID: 2042 Date: 12/01/2008 Time: 1:13:153 AM User: NT AUTHORITY\ANONYMOUS LOGON Computer: DALTX-DC00
Description: It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.
The reason that replication is not allowed to continue is that the two machine's views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.
Time of last successful replication:
2004-10-27 08:59:52
Invocation ID of source:
154ef845-f894-054e-88fc-a205dcbff605 Name of source: 12345678-9abc-def1-1234-56789abcdef1._msdcs.domain.com
Tombstone lifetime (days): 60
So here is the breakdown of our time related AD replication failure: the motherboard that was replaced never had its time set in the BIOS, so the time the OS referenced was from 2003. It would probably be helpful for you to know that the OS, at startup, will read the BIOS CMOS clock and set the time within the OS to this value. When the machine booted up it read the BIOS time and proceeded to set the current system time to the new setting even though it was 5 years in the past.
This machine being the PDCe for the forest means that the other DC’s in the root and the other PDCe’s in the other domains in the forest will sync from it. So we have just managed to propagate a bogus time setting to all the other DC’s in the forest.
For a detailed explanation of how the Windows Time hierarchy works please review the TechNet documentation http://technet.microsoft.com/en-us/library/cc784800.aspx
The PDCe at the root of the forest then syncs from its local or Internet time source and set the time properly and this new time setting is then propagated throughout the environment.
All the DC’s that replicated when the date was set back to 2003 after which we receive the current time and before replication we check to see when we last replicated. The last replication time date stamp shows 2003 so as far as the machine is concerned we have not replicated for 5 years which just slightly exceeded the 60 -180 day Tombstone lifetime.
Recovering from this AD replication error state can be ugly and time consuming, though we have methods to ultimately resolve it. Besides the initial AD replication failure due to the replication quarantine for replication partners that have not replicated for a period greater than that the tombstone lifetime the chances are very high that you will experience lingering objects. So the operative word here is prevention.
What we could have done to prevent or limit the DC’s from experiencing a large time offset? Here are some ideas:
1. Set the motherboard BIOS time to the current date and time before booting the operating system by powering the machine up and selecting the BIOS or system configuration settings.
2. Transfer or even seize the PDCe FSMO role to a different machine before reintroducing the PDCe with the replaced motherboard to the environment.
3. Implement KB 884776 - How to configure the Windows Time service against a large time offset. This will effectively prevent a machine from correcting it time offset beyond the hard upper and lower limits. If this is in place on the DC’s, Servers, and clients we would not see this scenario as a big problem.
One final comment concerning the circumstances of this issue occurring; anything that can change the time on a DC can cause this issue. BIOS on the motherboard being reset, the BIOS battery going bad, a poorly patched DC getting a virus which flips the time, a router or hardware based time solution being used as the central Network Time Protocol (NTP) time source, etc…
Hopefully you will never experience this type of issue since with a little forethought and configuration you will be able to completely prevent an otherwise difficult situation.
- Todd Maxey
Ned here with a quick heads up. The IE 8 dev blog has posted some news about Group Policy changes in Internet Explorer 8. It's definitely worth a read:Group Policy Support Updated in IE8
The article is mainly tickler, but it links to the extremely interesting:
Internet Explorer 8 Deployment Guide
That goes into insane detail on all the new GP options for IE8. 1300 new policy settings in fact! This reference goes into far more detail about IE8 itself as well and is worth a save in the Favorites folder.
- Ned "Posting this from IE8 Beta 2" Pyle