Microsoft's official enterprise support blog for AD DS and more
Hi, Ned here. Today I’m going to talk about troubleshooting Domain Controllers that are responding poorly due to high LSASS CPU utilization. I’ve split this article into two parts because there are actually two major forks that happen in this scenario:· You find that the problem is coming from the network and affecting the DC remotely.· You find that the problem is coming from the DC itself.LSASS is the Local Security Authority Subsystem Service. It provides an interface for managing local security, domain authentication, and Active Directory processes. A domain controller’s main purpose in life is to leverage LSASS to provide services to principals in your Active Directory forest. So when LSASS isn’t happy, the DC isn’t happy.
The first step to any kind of high LSASS CPU troubleshooting is to identify what ‘high’ really means. For us in Microsoft DS Support, we typically consider sustained and repeated CPU utilization at 80% or higher to be trouble if there’s no baseline of comparison. Periodic spikes that last a few seconds aren’t consequential (after all, you want your money’s worth of that new Quad Core), but if it lasts for ten to fifteen minutes straight and repeats constantly you may start seeing other problems: slower or failing logons, replication failures, etc. For you as an administrator of an AD environment, ‘high’ may mean something else – for example, if you are baselining your systems with MOM 2005 or SCOM 2007you may have already determined that normal CPU load on your DC’s is 20%. Then when all the DC’s start showing 50% CPU, this is aberrant behavior and you want to find out why. So it’s not necessary for the utilization to reach some magic number, just for it to become abnormal compared to what you know it typically baselines.
The next step is determine the scope – is this happening to all DC’s or just ones in a particular physical or logical AD site? Is it just the PDC Emulator? This helps us focus our troubleshooting and data collection. If it’s just the PDCE can we temporarily move the role to another server (be careful about this if using external domain trusts that rely on static LMHOSTS entries)? If the utilization follows, the problem is potentially an older application using legacy domain API’s that were designed for NT4. Perhaps this application can be turned off, modified or updated. There could also be down-level legacy OS’s in the environment, such as NT4 workstations and they are overloading the PDCE. There are also components within AD that focus their attention on the PDCE as a matter of convenience (password chaining, DFS querying, etc). If you are only seeing the issue on the PDCE, examine Summary of "Piling On" Scenarios in Active Directory Domains.
The next step is to identify if the issue is coming from the network or on the DC itself. I’ll be frank here – 99.999% of the time the issue is going to come off box. If you temporarily pull the network cable from the DC and wait fifteen minutes, LSASS is nearly guaranteed to drop back down to ~1% (why 15 minutes? That’s above the connection timeout for most components and after that time the DC should have given up on trying to service any more requests that were already queued). If it doesn’t drop, we know the problem is local to this DC. So here’s where this blog forks:
You find that the problem is coming from the network and affecting the DC remotely. It’s not just the PDCE being affected.
We can take a layered approach to troubleshooting high LSASS coming from the network. This involves a couple of tools and methods:
· Windows Server 2003 Performance Advisor (SPA)
· Wireshark (formerly Ethereal) network analysis utility (note: Wireshark is not affiliated with or supported by Microsoft in any way. Wireshark is used by Microsoft Product Support for network trace analysis in certain scenarios like this one. We also use Netmon 3.1, NMCAP, Netcap, and the built-in Netmon 2.x Lite tools. This is in no way an endorsement of Wireshark.).
Server Performance Analyzer (SPA) can be useful for seeing snapshots of Domain Controller performance and getting some hints on what’s causing high CPU. While it’s easier to analyze than a network trace, it’s also more limited in what it can understand. It runs on the affected server, so if the CPU is so busy that the server isn’t really responsive it won’t be helpful. Let’s take a look at a machine which is seeing fairly high CPU, but where it’s still usable:
We’ve installed SPA, started it up and selected Active Directory for our Data Collector, like so:
We then execute the data collection which then runs for 15 minutes. This reads performance counters and other mechanisms in order to create us some reports.
We open that current report and are greeted with some summary information. We can see here that overall CPU is at 61% and that most of the CPU time is against LSASS. We also see that it’s mainly LDAP requests eating up the processor, and that one particular remote machine is accounting for an abnormally large amount of it.
So we drill a little deeper into the details from SPA and look in the unique LDAP searches area. The two machines below are sending a deep query (i.e. it searches all subtrees from the base of our domain naming context) using some filters based on attributes of ‘whenCreated’ and ‘whenChanged’. Odd.
We’re still not convinced though – after all, SPA takes short snapshots and it really focuses on LDAP communication. What if it happened to capture a behavioral red herring? Let’s get some confirmation.
We start by getting some 100MB full frame network captures. We can use the built-in Netmon Lite tool, use NETCAPfrom the Windows Support Tools, or anything you feel comfortable with. Doing less than 100MB means our sample will be too small; doing more than 100MB means that the trace filtering becomes unwieldy. Getting more than one is advisable.
So we have our CAP file and we open it up in Wireshark. We click ‘Statistics’ then ‘Conversations’. This executes a built in filter that generates a TSV-formatted output (which you can throw into Excel and graph if you want to be fancy for upper management). Hit the IPv4 tab and we see:
Whoa, very interesting. 10.80.0.13 and 10.70.0.11 seem to be involved in two massive conversations with our DC, and everything else looks pretty quiet. Looking back at our SPA we see that .13 address listed and if we NSLOOKUP the 11 address we find it’s the XPPRO11A machine. I think we’re on to something here.
We set a filter for the 10.80.0.13 machine in our CAP file and set it to only care about LDAP traffic, like so:
We can see that the 10.80.0.13 machine is making constant LDAP requests against our DC. That unto itself isn’t very normal – a Windows machine doesn’t typically send a barrage of queries all day, it sends small spurts to do specific things like logon, lookup a group membership, or process group policy. What exactly is this thing doing? Let’s look at the details of one of these requests in Wireshark:
Well, it’s definitely the same thing we saw in SPA. We’re connecting to the base of the domain Litewareinc.com, and we’re searching for every single object’s create time stamp. We know that we have 100,000 users, 100,000 groups, and 100,000 computers in this domain, and using such a wide open query is going to be expensive for LSASS to process. But it still seems that we should be able to handle this? What makes this attribute cost us so much processing?
We run regsvr32 schmmgmt.dll in order to gain access to the Schema Management snap-in, then run MMC.EXE and add Active Directory Schema. Under the Attributes node we poke around and find createTimeStamp.
Well isn’t that a kick in the teeth – this attribute isn’t indexed by default! No wonder it’s running so painfully. Luckily DC’s cache frequently used queries or we’d be in even worse shape with disk IO. We don’t want to just go willy-nilly adding indexes in Active Directory as that can have its own set of memory implications. So we have a quick chat with the owner of those machines and he admits that they recently changed their custom LDAP application yesterday (when the problem started!). It was supposed to be getting back some specific information about user account creation but it had a bug and it was asking about every single object in Active Directory. They change their app and everything returns to normal – high fives for the good guys in Server Administration.
So today we learned about troubleshooting high LSASS CPU processing from a remote source. Next time we will diagnose a machine that’s having problems even after we pull it off the network. Stay tuned.
Read this excellent post on SPAfor more info on that tool.
Read this write-upon query inefficiency. This is what you give to that LDAP developer that was beating up your DC’s!
LSASS memory utilization is an entirely different story. The JET database used by Domain Controllers is highly optimized for read operations, and consequently LSASS tries to allocate as much virtual memory through caching as it possibly can to make queries fast (and deallocates that memory if requested by other applications). This is why we recommend that whenever possible, a DC’s role should only be a DC – not also a file server, a SQL server, and Exchange server, an ISA box, and the rest. While this isn’t always possible, it’s our best practice advice. For more, read: Memory usage by the Lsass.exe process on domain controllers that are running Windows Server 2003 or Windows 2000 Server
For Windows 2000 we have an older SPA-like tool called ADPERF, but it’s only available if you open a support case with us.
For part 2, go here.
- Ned Pyle
Hi, Ned here. I’m a Technical Lead in Directory Services out of Charlotte, NC. Today I’m going to talk a little bit about a common customer question: how do I leverage group policy to deploy custom registry settings? I’ll be showing two ways to do this… the easier versus the harder. Why would you ever want to do the harder? Read on!
You’re administering thousands of Vista workstations and their applications, and you spend a lot of your day connecting to them for troubleshooting and maintenance. You’ve found that you’re using Windows Calculator all the time to convert hex to decimal and reverse; it’s the best way to search for error codes online after all. After the hundredth time that you’ve had to set the calculator from Standard to Scientific mode, you’ve decided to make it default to Scientific. So let’s learn about how to actually figure out where values get set, then how we can control them.
Figuring out the registry entry
It stands to reason that Vista’s Calculator has to store which mode it’s going to start in somewhere, and that this somewhere is probably the registry. So let’s download Process Monitor and use it for some light reverse engineering. We’re guessing that CALC.EXE will read and write the values, and that it will be registry related. So we start ProcMon.exe, then set a filter for a process of calc.exe and an operation of RegSetValue, like so:
We then start the calculator, and we switch it over to scientific mode. The filtered results are pretty short, and we see:
It’s doubtful the cryptography entries are anything but chaff, so let’s focus on this setting change for HKCU\Software\Calc\Layout. We right-click that line and choose ‘Jump to…’
This takes us into the registry editor, where we see what actually got changed. Pretty slick!
It looks like the DWORD value name ‘layout’ is our guy. We confirm by setting it to 1 and restarting calculator. It’s back to Standard mode. We restart calculator with the value set to 0 and now it’s Scientific again. So I think we’ve got what we need to do some group policy work.
The Easier Way
We’re just making a simple registry value change here, so why not use REGEDIT.EXE in silent mode to set it? To do this we:
1. Export this registry value to a file called SciCalc.reg
Windows Registry Editor Version 5.00[HKEY_CURRENT_USER\Software\Microsoft\Calc]"layout"=dword:00000000
Windows Registry Editor Version 5.00[HKEY_CURRENT_USER\Software\Microsoft\Calc]"layout"=dword:00000000
2. We create a new Group Policy object and link it to the OU we have configured for all the administrative users in the domain (ourselves and our super powerful colleagues).
3. We open it up and edit “User Configuration | Windows Settings | Scripts (Logon/Logoff).
4. Under the Logon node, we add our settings so that regedit.exe calls our SciCalc.reg file silently (with the /s switch):
5. We click Show Files and drop our SciCalc.reg into SYSVOL.
6. Now we’re all set. After this policy replicates around the Domain Controllers and we logon to the various Vista workstations in the domain, Windows Calculator will always start in scientific mode. Neat.
The Harder Way
The logon script method is pretty down and dirty. While it works, it’s not very elegant. It also means that we have some settings that are done without really leveraging Group Policy’s registry editing extensions. It’s pretty opaque to other administrators as well, since all you can tell about a logon script applying is that it ran – not much else about if it was successful, what it was really doing, why it exists, etc. So what if we make a new custom ADM template file and apply it that way?
ADM files are the building blocks of registry-based policy customization. They use a controlled format to make changes. They can also be used to set boundaries of values – for the Calculator example there are only two possible good values: 0 or 1, as a DWORD (32-bit) value. Using an ADM lets us control what we can choose, and also gives good explanation of what we’re accomplishing. Plus it’s really cool.
So taking what we know from our registry work, let’s dissect an ADM file that will do the same thing:
<ADM Starts Here>
CLASS USERCATEGORY Windows_Calculator_(CustomADM) POLICY Mode EXPLAIN !!CalcHelp KEYNAME Software\Microsoft\Calc PART !!Calc_Configure DROPDOWNLIST REQUIRED VALUENAME "layout" ITEMLIST NAME !!Scientific VALUE NUMERIC 0 DEFAULT NAME !!Standard VALUE NUMERIC 1 END ITEMLIST END PART END POLICYEND CATEGORY
[strings]WindowsCalculatorCustomADM="Windows Calculator Settings"Calc_Configure="Set the Windows Calculator to: "Scientific="Scientific mode"Standard="Standard mode"
; explainsCalcHelp="You can set the Windows Calculator's behavior to default to Scientific or Standard. Users can still change it but it will revert when group policy refreshes. This sample ADM was created by Ned Pyle, MSFT."
</ADM Ends Here>
· CLASS describes User versus Computer policy.· CATEGORY describes the node we will see in the Group Policy Editor.· POLICY describes what we see to actually edit.· EXPLAIN describes where we can look up the ‘help’ for this policy.· KEYNAME is the actual registry key structure we’re touching.· PART is used if we have multiple settings to choose, and how they will be displayed· VALUENAME is the registry value we’re editing· NAME describes the friendly and literal data to be written
So here we have a policy setting which will be called “Windows_Calculator_(CustomADM)” that will expose one entry called ‘Mode’. Mode will be a dropdown that can be used to select Standard or Scientific. Pretty simple… so how do we get this working?
1. We save our settings as an ADM file.
2. We load up GPMC, then create and link our new policy to that Admins OU.
3. We open our new policy and under User Configuration we right-click Administrative Templates and select Add/Remove Administrative Templates.
4. We find our ADM and highlight it, then select Add. It will be copied into the policy in SYSVOL automagically.
5. Now we highlight Administrative Templates and select View | Filtering. Uncheck "Only show policy settings that can be fully managed" (i.e. any custom policy). It will look like this:
6. Now if we navigate to your policy, we get this (see the cool explanation too? No one can say they don’t know what this policy is about!):
7. If we drill into the Mode setting, we have this:
And you’re done. A bit more work, but pretty rewarding and certainly much easier for your colleagues to work with, especially if you have delegated out group policy administration to a staff of less experienced admins.
My little examples above with Calculator only work on Windows Vista and Windows Server 2008. Prior to those versions we used the WIN.INI to set calculator settings – D’oh! Now you have a very compelling reason to upgrade... ;-)
These sorts of custom policy settings are not managed like built-in group policies – this means that simply removing them does not remove their settings. If you want to back out their changes, you need to create a new policy that removes their settings directly.
ADMX’s can also be used on Vista/2008, but I’m saving those for a later posting as they make ADM’s look trivial.
This is just a taste of custom ADM file usage. If you want to get really into this, I highly suggest checking out:
Using Administrative Template Files with Registry-Based Group Policy - http://www.microsoft.com/downloads/details.aspx?familyid=e7d72fa1-62fe-4358-8360-8774ea8db847&displaylang=en
Administrative Template File Format - http://msdn2.microsoft.com/en-us/library/aa372405.aspx
I am going to start off the technical topics with a fairly light yet very confusing topic- once it’s explained though it’s very simple.
Terminal Server Licensing is probably among the easiest for us to troubleshoot, however, there are so many different scenarios it gets confusing FAST!
The story on Terminal Server Licensing changes dramatically from Windows 2000 to Windows Server 2003. Here I’d like to see if I can explain the Server 2003 scenarios. If you have specific questions about 2000 just ask!
A client access licensed is issued to every type of client that will access the Windows Server 2003 Terminal Server, here is a link to the legal part of this if you need it (here I am sticking to the technical facts):
This includes Windows Server 2003 client connections, Windows XP (all versions), Thin clients etc.
The other big change is there are now two types of licenses: Per User and Per Device. Built-in Licenses still exist so that the Windows Server 2003 Terminal Server Licensing Server can support/ issue licenses to Windows 2000 Terminal Servers.
This type of license is not managed right now. What this means is when you have your terminal server configured in a PER USER Licensing mode in Terminal Server Configuration console the Terminal Server must be able to discover an activated terminal server license server. As long as it can do that a user will never be denied a connection to the terminal server based on licensing. You will never see the number of available licenses decremented in the Terminal Server Licensing snap in either.
When the Windows Server 2003 Terminal Server is configured to use PER DEVICE license mode it will behave just like Windows 2000 used to. A computer will connect to the terminal server and get a temporary license, then connect again and get a permanent license. This license will expire at 90 days. Some point before it expires it will renew, if the client doesn’t connect in the time period before the license expires the license will go back into the available pool of licenses the Terminal Server Licensing console.
This type of license still exists on Windows Server 2003 for backward compatibility for Windows 2000 Terminal Servers.
Here is an excellent resource for additional information regarding terminal server licensing and how you can troubleshoot many issues related to licensing such as license server discovery, only temporary licensing being issued etc.:
With all that being said- one last important piece of advice- When to call the Licensing Activation Team (also known as the Clearinghouse) and when to call support.
CALL THE CLEARINGHOUSE IF…..
Basically- if you need to activate licenses, change the type of activation, reactivate a license server or license pack, or to reclaim lost licenses before their expiration date that all has to be done through the clearinghouse. They can be contacted via:
• In the United States, call (800) 426-9400 or visit the Microsoft Licensing Program Reseller Web site.
• In Canada, call the Microsoft Resource Centre at (877) 568-2495.
• If you are outside the United States or Canada, please review the Worldwide Microsoft Licensing Web sites or contact your local Microsoft subsidiary on the Microsoft Worldwide Home Web site. ]
CALL SUPPORT WHEN……..
When you get errors in the event logs about not being able to find the license server, in the license manager snap-in, or on the client workstation machines trying to connect to the terminal server that is when you follow the link above if that doesn’t work you may need support J
This just begins to scratch the surface- more posts later on this topic but if you are having terminal server licensing problems be sure to go to this link before you get too frustrated J
Welcome to the Enterprise Platform Support Directory Services Team blog.
We are a team of folks who support Active Directory for Microsoft. As our name implies, we support Active Directory and its associated components - Group Policies, Certificate Services, DFS and Kerberos, to name but a few - all of these technologies are our bread and butter.
Our goal is twofold - first, we want to help educate our readers about Active Directory - whether they are our direct customers, admins who are new to AD, or anyone else who is just interested in AD. We'll also provide troubleshooting tips, tricks and information on common issues.
In our environment we troubleshoot what is already broken- that is what we are good at. We hope to provide more detail around how we do that, but more importantly help to educate our readers on the technical details of why we did what we did and how things work.
If you have a question, any suggestions on a topic you would like to see or any other general feedback please share! We want to be your link to the rest of the community and help what problems you do come across get solved faster with a greater understanding of why it occurred to begin with.
Keep watching- our first few technical posts are coming soon!