Microsoft's official enterprise support blog for AD DS and more
Hi, Ned here. Today I’m going to talk about troubleshooting Domain Controllers that are responding poorly due to high LSASS CPU utilization. I’ve split this article into two parts because there are actually two major forks that happen in this scenario:· You find that the problem is coming from the network and affecting the DC remotely.· You find that the problem is coming from the DC itself.LSASS is the Local Security Authority Subsystem Service. It provides an interface for managing local security, domain authentication, and Active Directory processes. A domain controller’s main purpose in life is to leverage LSASS to provide services to principals in your Active Directory forest. So when LSASS isn’t happy, the DC isn’t happy.
The first step to any kind of high LSASS CPU troubleshooting is to identify what ‘high’ really means. For us in Microsoft DS Support, we typically consider sustained and repeated CPU utilization at 80% or higher to be trouble if there’s no baseline of comparison. Periodic spikes that last a few seconds aren’t consequential (after all, you want your money’s worth of that new Quad Core), but if it lasts for ten to fifteen minutes straight and repeats constantly you may start seeing other problems: slower or failing logons, replication failures, etc. For you as an administrator of an AD environment, ‘high’ may mean something else – for example, if you are baselining your systems with MOM 2005 or SCOM 2007you may have already determined that normal CPU load on your DC’s is 20%. Then when all the DC’s start showing 50% CPU, this is aberrant behavior and you want to find out why. So it’s not necessary for the utilization to reach some magic number, just for it to become abnormal compared to what you know it typically baselines.
The next step is determine the scope – is this happening to all DC’s or just ones in a particular physical or logical AD site? Is it just the PDC Emulator? This helps us focus our troubleshooting and data collection. If it’s just the PDCE can we temporarily move the role to another server (be careful about this if using external domain trusts that rely on static LMHOSTS entries)? If the utilization follows, the problem is potentially an older application using legacy domain API’s that were designed for NT4. Perhaps this application can be turned off, modified or updated. There could also be down-level legacy OS’s in the environment, such as NT4 workstations and they are overloading the PDCE. There are also components within AD that focus their attention on the PDCE as a matter of convenience (password chaining, DFS querying, etc). If you are only seeing the issue on the PDCE, examine Summary of "Piling On" Scenarios in Active Directory Domains.
The next step is to identify if the issue is coming from the network or on the DC itself. I’ll be frank here – 99.999% of the time the issue is going to come off box. If you temporarily pull the network cable from the DC and wait fifteen minutes, LSASS is nearly guaranteed to drop back down to ~1% (why 15 minutes? That’s above the connection timeout for most components and after that time the DC should have given up on trying to service any more requests that were already queued). If it doesn’t drop, we know the problem is local to this DC. So here’s where this blog forks:
You find that the problem is coming from the network and affecting the DC remotely. It’s not just the PDCE being affected.
We can take a layered approach to troubleshooting high LSASS coming from the network. This involves a couple of tools and methods:
· Windows Server 2003 Performance Advisor (SPA)
· Wireshark (formerly Ethereal) network analysis utility (note: Wireshark is not affiliated with or supported by Microsoft in any way. Wireshark is used by Microsoft Product Support for network trace analysis in certain scenarios like this one. We also use Netmon 3.1, NMCAP, Netcap, and the built-in Netmon 2.x Lite tools. This is in no way an endorsement of Wireshark.).
Server Performance Analyzer (SPA) can be useful for seeing snapshots of Domain Controller performance and getting some hints on what’s causing high CPU. While it’s easier to analyze than a network trace, it’s also more limited in what it can understand. It runs on the affected server, so if the CPU is so busy that the server isn’t really responsive it won’t be helpful. Let’s take a look at a machine which is seeing fairly high CPU, but where it’s still usable:
We’ve installed SPA, started it up and selected Active Directory for our Data Collector, like so:
We then execute the data collection which then runs for 15 minutes. This reads performance counters and other mechanisms in order to create us some reports.
We open that current report and are greeted with some summary information. We can see here that overall CPU is at 61% and that most of the CPU time is against LSASS. We also see that it’s mainly LDAP requests eating up the processor, and that one particular remote machine is accounting for an abnormally large amount of it.
So we drill a little deeper into the details from SPA and look in the unique LDAP searches area. The two machines below are sending a deep query (i.e. it searches all subtrees from the base of our domain naming context) using some filters based on attributes of ‘whenCreated’ and ‘whenChanged’. Odd.
We’re still not convinced though – after all, SPA takes short snapshots and it really focuses on LDAP communication. What if it happened to capture a behavioral red herring? Let’s get some confirmation.
We start by getting some 100MB full frame network captures. We can use the built-in Netmon Lite tool, use NETCAPfrom the Windows Support Tools, or anything you feel comfortable with. Doing less than 100MB means our sample will be too small; doing more than 100MB means that the trace filtering becomes unwieldy. Getting more than one is advisable.
So we have our CAP file and we open it up in Wireshark. We click ‘Statistics’ then ‘Conversations’. This executes a built in filter that generates a TSV-formatted output (which you can throw into Excel and graph if you want to be fancy for upper management). Hit the IPv4 tab and we see:
Whoa, very interesting. 10.80.0.13 and 10.70.0.11 seem to be involved in two massive conversations with our DC, and everything else looks pretty quiet. Looking back at our SPA we see that .13 address listed and if we NSLOOKUP the 11 address we find it’s the XPPRO11A machine. I think we’re on to something here.
We set a filter for the 10.80.0.13 machine in our CAP file and set it to only care about LDAP traffic, like so:
We can see that the 10.80.0.13 machine is making constant LDAP requests against our DC. That unto itself isn’t very normal – a Windows machine doesn’t typically send a barrage of queries all day, it sends small spurts to do specific things like logon, lookup a group membership, or process group policy. What exactly is this thing doing? Let’s look at the details of one of these requests in Wireshark:
Well, it’s definitely the same thing we saw in SPA. We’re connecting to the base of the domain Litewareinc.com, and we’re searching for every single object’s create time stamp. We know that we have 100,000 users, 100,000 groups, and 100,000 computers in this domain, and using such a wide open query is going to be expensive for LSASS to process. But it still seems that we should be able to handle this? What makes this attribute cost us so much processing?
We run regsvr32 schmmgmt.dll in order to gain access to the Schema Management snap-in, then run MMC.EXE and add Active Directory Schema. Under the Attributes node we poke around and find createTimeStamp.
Well isn’t that a kick in the teeth – this attribute isn’t indexed by default! No wonder it’s running so painfully. Luckily DC’s cache frequently used queries or we’d be in even worse shape with disk IO. We don’t want to just go willy-nilly adding indexes in Active Directory as that can have its own set of memory implications. So we have a quick chat with the owner of those machines and he admits that they recently changed their custom LDAP application yesterday (when the problem started!). It was supposed to be getting back some specific information about user account creation but it had a bug and it was asking about every single object in Active Directory. They change their app and everything returns to normal – high fives for the good guys in Server Administration.
So today we learned about troubleshooting high LSASS CPU processing from a remote source. Next time we will diagnose a machine that’s having problems even after we pull it off the network. Stay tuned.
Read this excellent post on SPAfor more info on that tool.
Read this write-upon query inefficiency. This is what you give to that LDAP developer that was beating up your DC’s!
LSASS memory utilization is an entirely different story. The JET database used by Domain Controllers is highly optimized for read operations, and consequently LSASS tries to allocate as much virtual memory through caching as it possibly can to make queries fast (and deallocates that memory if requested by other applications). This is why we recommend that whenever possible, a DC’s role should only be a DC – not also a file server, a SQL server, and Exchange server, an ISA box, and the rest. While this isn’t always possible, it’s our best practice advice. For more, read: Memory usage by the Lsass.exe process on domain controllers that are running Windows Server 2003 or Windows 2000 Server
For Windows 2000 we have an older SPA-like tool called ADPERF, but it’s only available if you open a support case with us.
For part 2, go here.
- Ned Pyle
Hi, Ned here. I’m a Technical Lead in Directory Services out of Charlotte, NC. Today I’m going to talk a little bit about a common customer question: how do I leverage group policy to deploy custom registry settings? I’ll be showing two ways to do this… the easier versus the harder. Why would you ever want to do the harder? Read on!
You’re administering thousands of Vista workstations and their applications, and you spend a lot of your day connecting to them for troubleshooting and maintenance. You’ve found that you’re using Windows Calculator all the time to convert hex to decimal and reverse; it’s the best way to search for error codes online after all. After the hundredth time that you’ve had to set the calculator from Standard to Scientific mode, you’ve decided to make it default to Scientific. So let’s learn about how to actually figure out where values get set, then how we can control them.
Figuring out the registry entry
It stands to reason that Vista’s Calculator has to store which mode it’s going to start in somewhere, and that this somewhere is probably the registry. So let’s download Process Monitor and use it for some light reverse engineering. We’re guessing that CALC.EXE will read and write the values, and that it will be registry related. So we start ProcMon.exe, then set a filter for a process of calc.exe and an operation of RegSetValue, like so:
We then start the calculator, and we switch it over to scientific mode. The filtered results are pretty short, and we see:
It’s doubtful the cryptography entries are anything but chaff, so let’s focus on this setting change for HKCU\Software\Calc\Layout. We right-click that line and choose ‘Jump to…’
This takes us into the registry editor, where we see what actually got changed. Pretty slick!
It looks like the DWORD value name ‘layout’ is our guy. We confirm by setting it to 1 and restarting calculator. It’s back to Standard mode. We restart calculator with the value set to 0 and now it’s Scientific again. So I think we’ve got what we need to do some group policy work.
The Easier Way
We’re just making a simple registry value change here, so why not use REGEDIT.EXE in silent mode to set it? To do this we:
1. Export this registry value to a file called SciCalc.reg
Windows Registry Editor Version 5.00[HKEY_CURRENT_USER\Software\Microsoft\Calc]"layout"=dword:00000000
Windows Registry Editor Version 5.00[HKEY_CURRENT_USER\Software\Microsoft\Calc]"layout"=dword:00000000
2. We create a new Group Policy object and link it to the OU we have configured for all the administrative users in the domain (ourselves and our super powerful colleagues).
3. We open it up and edit “User Configuration | Windows Settings | Scripts (Logon/Logoff).
4. Under the Logon node, we add our settings so that regedit.exe calls our SciCalc.reg file silently (with the /s switch):
5. We click Show Files and drop our SciCalc.reg into SYSVOL.
6. Now we’re all set. After this policy replicates around the Domain Controllers and we logon to the various Vista workstations in the domain, Windows Calculator will always start in scientific mode. Neat.
The Harder Way
The logon script method is pretty down and dirty. While it works, it’s not very elegant. It also means that we have some settings that are done without really leveraging Group Policy’s registry editing extensions. It’s pretty opaque to other administrators as well, since all you can tell about a logon script applying is that it ran – not much else about if it was successful, what it was really doing, why it exists, etc. So what if we make a new custom ADM template file and apply it that way?
ADM files are the building blocks of registry-based policy customization. They use a controlled format to make changes. They can also be used to set boundaries of values – for the Calculator example there are only two possible good values: 0 or 1, as a DWORD (32-bit) value. Using an ADM lets us control what we can choose, and also gives good explanation of what we’re accomplishing. Plus it’s really cool.
So taking what we know from our registry work, let’s dissect an ADM file that will do the same thing:
<ADM Starts Here>
CLASS USERCATEGORY Windows_Calculator_(CustomADM) POLICY Mode EXPLAIN !!CalcHelp KEYNAME Software\Microsoft\Calc PART !!Calc_Configure DROPDOWNLIST REQUIRED VALUENAME "layout" ITEMLIST NAME !!Scientific VALUE NUMERIC 0 DEFAULT NAME !!Standard VALUE NUMERIC 1 END ITEMLIST END PART END POLICYEND CATEGORY
[strings]WindowsCalculatorCustomADM="Windows Calculator Settings"Calc_Configure="Set the Windows Calculator to: "Scientific="Scientific mode"Standard="Standard mode"
; explainsCalcHelp="You can set the Windows Calculator's behavior to default to Scientific or Standard. Users can still change it but it will revert when group policy refreshes. This sample ADM was created by Ned Pyle, MSFT."
</ADM Ends Here>
· CLASS describes User versus Computer policy.· CATEGORY describes the node we will see in the Group Policy Editor.· POLICY describes what we see to actually edit.· EXPLAIN describes where we can look up the ‘help’ for this policy.· KEYNAME is the actual registry key structure we’re touching.· PART is used if we have multiple settings to choose, and how they will be displayed· VALUENAME is the registry value we’re editing· NAME describes the friendly and literal data to be written
So here we have a policy setting which will be called “Windows_Calculator_(CustomADM)” that will expose one entry called ‘Mode’. Mode will be a dropdown that can be used to select Standard or Scientific. Pretty simple… so how do we get this working?
1. We save our settings as an ADM file.
2. We load up GPMC, then create and link our new policy to that Admins OU.
3. We open our new policy and under User Configuration we right-click Administrative Templates and select Add/Remove Administrative Templates.
4. We find our ADM and highlight it, then select Add. It will be copied into the policy in SYSVOL automagically.
5. Now we highlight Administrative Templates and select View | Filtering. Uncheck "Only show policy settings that can be fully managed" (i.e. any custom policy). It will look like this:
6. Now if we navigate to your policy, we get this (see the cool explanation too? No one can say they don’t know what this policy is about!):
7. If we drill into the Mode setting, we have this:
And you’re done. A bit more work, but pretty rewarding and certainly much easier for your colleagues to work with, especially if you have delegated out group policy administration to a staff of less experienced admins.
My little examples above with Calculator only work on Windows Vista and Windows Server 2008. Prior to those versions we used the WIN.INI to set calculator settings – D’oh! Now you have a very compelling reason to upgrade... ;-)
These sorts of custom policy settings are not managed like built-in group policies – this means that simply removing them does not remove their settings. If you want to back out their changes, you need to create a new policy that removes their settings directly.
ADMX’s can also be used on Vista/2008, but I’m saving those for a later posting as they make ADM’s look trivial.
This is just a taste of custom ADM file usage. If you want to get really into this, I highly suggest checking out:
Using Administrative Template Files with Registry-Based Group Policy - http://www.microsoft.com/downloads/details.aspx?familyid=e7d72fa1-62fe-4358-8360-8774ea8db847&displaylang=en
Administrative Template File Format - http://msdn2.microsoft.com/en-us/library/aa372405.aspx
Hi, Dave here. I’m a Support Escalation Engineer in Directory Services out of Charlotte, NC. Recently one of our consultants in the field deployed a Windows Server 2008 Beta 3 domain controller at a branch office to test management scenarios. After doing this, they discovered that the server was not replicating with domain controllers at the main datacenter. After running some network captures, they discovered that the local firewall was blocking replication traffic.
The firewall in question was configured properly to allow Windows Server 2003 domain controllers to replicate, but the Win2008 domain controller was blocked. This is not because we changed the way replication works per se. Replication is still accomplished by server to server RPC calls, the same as in Win2003. But we did change the underlying mechanism that the network stack uses to determine which ports those RPC calls use.
By default, the dynamic port range in Windows Server 2003 was 1024-5000 for both TCP and UDP.
In Windows Server 2008 (and Windows Vista), the dynamic port range is 49152-65535, for both TCP and UDP.
What this means is that any server-to-server RPC traffic (including AD replication traffic) is suddenly using an entirely new port range over the wire. We made this change in order to comply with IANA recommendations about port usage. Therefore, if you start deploying Windows Server 2008 on your network, and are using firewalls to restrict traffic on your internal network you will need to update the configuration of those firewalls to compensate for the new port range.
It doesn’t stop at RPC traffic though. The dynamic port range is used for any and all outbound requests from your computer that don’t use a specific source port. This means that if you fire up Internet Explorer and browse to a web page, the network traffic is going to source from a port higher than 49152 on Vista or 2008. This means that potentially, any application that connects to other machines via the network could be impacted by a firewall that’s not configured for this change. In Directory Services support here at Microsoft, we really care mostly about Active Directory related traffic, but this is something that everyone should watch out for. So for example, look at this snippet of a NETSTAT command run on a Vista machine where we are simply connected to a web site with IE7:
C:\Windows\system32>netstat -bnActive ConnectionsProto Local Address Foreign Address StateTCP 10.10.0.10:53556 220.127.116.11:80 TIME_WAITTCP 10.10.0.10:53572 18.104.22.168:80 TIME_WAIT[iexplore.exe]
In Vista and 2008, most administration of things at the network stack level is handled via NETSH. Using NETSH, it’s possible to see what your dynamic port range is set to on a per server basis:
>netsh int ipv4 show dynamicport tcp>netsh int ipv4 show dynamicport udp>netsh int ipv6 show dynamicport tcp>netsh int ipv6 show dynamicport udp
These commands will output the dynamic port range currently in use. Kind of a neat fact is that you can have different ranges for TCP and UDP, or for IPv4 and IPv6, although they all start off the same.
In Windows Server 2003 the range always defaults to starting with TCP port 1024, and that is hard-coded. But in Vista/2008, you can move the starting point of the range around. So if you needed to, you could tell your servers to use ports 5000 through 15000 for dynamic port allocations, or any contiguous range of ports you wanted. To do this, you use NETSH again:
>netsh int ipv4 set dynamicport tcp start=10000 num=1000 >netsh int ipv4 set dynamicport udp start=10000 num=1000 >netsh int ipv6 set dynamicport tcp start=10000 num=1000 >netsh int ipv4 set dynamicport udp start=10000 num=1000
The examples above would set your dynamic port range to start at port 10000 and go through port 11000 (1000 ports).
A few important things to know about the port range:
· The smallest range of ports you can set is 255.· The lowest starting port that you can set is 1025.· The highest end port (based on the range you set) cannot exceed 65535.
For more information on this, check out KB 929851.
At this point you’re probably wondering what our recommendation is for configuring firewalls for AD replication with Windows Server 2008. Generally speaking, we don’t recommend that you restrict traffic between servers on your internal network. If you must deploy firewalls between servers, you should use IPSEC or VPN tunnels to allow all traffic between those servers to pass through, regardless of source or destination ports. However, experience has taught us that some customers are going to want to restrict traffic, which is why it is possible to configure this range and control the ports that will be used.
Here are two FAQs that have come up internally around this change:
Q: How do the changes to the dynamic port ranges affect AD replication?
A: AD replication relies on dynamically allocated ports for both sides of the replication connection. This means that by default, replication traffic will now use ports higher than 49152 on both domain controllers involved in the transaction.
Q: Can the port that replication traffic uses be controlled?
A: It is still possible to restrict replication traffic to a specific port using the registry values documented in KB 224196.
- Dave Beach
I am going to start off the technical topics with a fairly light yet very confusing topic- once it’s explained though it’s very simple.
Terminal Server Licensing is probably among the easiest for us to troubleshoot, however, there are so many different scenarios it gets confusing FAST!
The story on Terminal Server Licensing changes dramatically from Windows 2000 to Windows Server 2003. Here I’d like to see if I can explain the Server 2003 scenarios. If you have specific questions about 2000 just ask!
A client access licensed is issued to every type of client that will access the Windows Server 2003 Terminal Server, here is a link to the legal part of this if you need it (here I am sticking to the technical facts):
This includes Windows Server 2003 client connections, Windows XP (all versions), Thin clients etc.
The other big change is there are now two types of licenses: Per User and Per Device. Built-in Licenses still exist so that the Windows Server 2003 Terminal Server Licensing Server can support/ issue licenses to Windows 2000 Terminal Servers.
This type of license is not managed right now. What this means is when you have your terminal server configured in a PER USER Licensing mode in Terminal Server Configuration console the Terminal Server must be able to discover an activated terminal server license server. As long as it can do that a user will never be denied a connection to the terminal server based on licensing. You will never see the number of available licenses decremented in the Terminal Server Licensing snap in either.
When the Windows Server 2003 Terminal Server is configured to use PER DEVICE license mode it will behave just like Windows 2000 used to. A computer will connect to the terminal server and get a temporary license, then connect again and get a permanent license. This license will expire at 90 days. Some point before it expires it will renew, if the client doesn’t connect in the time period before the license expires the license will go back into the available pool of licenses the Terminal Server Licensing console.
This type of license still exists on Windows Server 2003 for backward compatibility for Windows 2000 Terminal Servers.
Here is an excellent resource for additional information regarding terminal server licensing and how you can troubleshoot many issues related to licensing such as license server discovery, only temporary licensing being issued etc.:
With all that being said- one last important piece of advice- When to call the Licensing Activation Team (also known as the Clearinghouse) and when to call support.
CALL THE CLEARINGHOUSE IF…..
Basically- if you need to activate licenses, change the type of activation, reactivate a license server or license pack, or to reclaim lost licenses before their expiration date that all has to be done through the clearinghouse. They can be contacted via:
• In the United States, call (800) 426-9400 or visit the Microsoft Licensing Program Reseller Web site.
• In Canada, call the Microsoft Resource Centre at (877) 568-2495.
• If you are outside the United States or Canada, please review the Worldwide Microsoft Licensing Web sites or contact your local Microsoft subsidiary on the Microsoft Worldwide Home Web site. ]
CALL SUPPORT WHEN……..
When you get errors in the event logs about not being able to find the license server, in the license manager snap-in, or on the client workstation machines trying to connect to the terminal server that is when you follow the link above if that doesn’t work you may need support J
This just begins to scratch the surface- more posts later on this topic but if you are having terminal server licensing problems be sure to go to this link before you get too frustrated J
Welcome to the Enterprise Platform Support Directory Services Team blog.
We are a team of folks who support Active Directory for Microsoft. As our name implies, we support Active Directory and its associated components - Group Policies, Certificate Services, DFS and Kerberos, to name but a few - all of these technologies are our bread and butter.
Our goal is twofold - first, we want to help educate our readers about Active Directory - whether they are our direct customers, admins who are new to AD, or anyone else who is just interested in AD. We'll also provide troubleshooting tips, tricks and information on common issues.
In our environment we troubleshoot what is already broken- that is what we are good at. We hope to provide more detail around how we do that, but more importantly help to educate our readers on the technical details of why we did what we did and how things work.
If you have a question, any suggestions on a topic you would like to see or any other general feedback please share! We want to be your link to the rest of the community and help what problems you do come across get solved faster with a greater understanding of why it occurred to begin with.
Keep watching- our first few technical posts are coming soon!
Last time I discussed troubleshooting the most common high CPU scenario within LSASS, which is the server being beaten up by a remote machine. Let’s talk now about the much less common but still possible:
You find that the problem is coming from the DC itself.
As I said in the previous post, this is a super rare situation these days. If you are on Windows 2000 Server SP4 or Windows Server 2003 SP1/SP2, we really don’t have any known issues where we can simply hand you a hotfix and send you on your way. The most likely cause is something foreign to the operating system – an add-on security package, a custom password synchronizer, a service running something security-related, etc. A very down and dirty way to check these is:
Examine this registry key on your Windows Server 2003 (it will be slightly different on Win2000) machine being affected:
Anything else in here on Windows Server 2003 may be suspect, as something or someone has injected non-standard libraries into LSASS. It may be intentional and the DLL is simply malfunctioning or misconfigured. It may be malicious. Find the file (it will nearly always be a DLL in the %windir%\system32 directory) and take a look at its properties:
Once you think you have a handle on it, get a backup of your server and this registry key and remove the entry, then restart the server in a change control window when users are least affected. Does the high CPU utilization come back? It almost never does, trust me…
If there was nothing of interest there, another good technique is to use MSCONFIG to identify and potentially disable applications that have been added on to the server.
By checking the ‘Hide All Microsoft Services’ box you can see System Services that were added to the machine what did not ship with the operating system (technically speaking, you may see some services that are from us, such as Exchange). You can then temporarily set them to ‘Disabled’ and restart the server to test for the performance problem. The same can be done with the ‘Startup’ section, for apps that live in the RUN key of the registry. By using the ‘divide by half’ rule (where you disable half and test, disable the other half and test, then narrow down by halves until you find your culprit), you can usually get to the bad guy pretty quickly.
You can see all of this info using the Microsoft Product Support Reporting Tools (MPSRPT_DirSvc.exe) as well.
This blog post is not about debugging – yes, some of the techniques I use above can be replaced with attaching WINDBG to LSASS, syncing symbols, and going to town to see what’s specifically wrong under the covers. The posting is for folks looking for remediation, not code-level root cause. And let’s be honest – we debug things like this every day on customer request. After all the work is done (and the billing against the customer’s contract – ouch!), we still have the same answer: please contact your vendor about this malfunctioning code, as only they can fix it. If’ you’d like a quick primer on seeing what modules may be loaded into LSASS by using a debugger and that might be suspect, please let us know and we’ll blog it up.