- Important Security Bulletin
-
I wanted to do a quick post on an important security bulletin. It’s Microsoft Security Bulletin MS09-018 – Critical. This security update is to address a vulnerability in Active Directory. I’m pasting the Executive Summary below, but I highly recommend that you read the entire bulletin and apply the updates.
Executive Summary
This security update resolves two privately reported vulnerabilities in implementations of Active Directory on Microsoft Windows 2000 Server and Windows Server 2003, and Active Directory Application Mode (ADAM) when installed on Windows XP Professional and Windows Server 2003. The more severe vulnerability could allow remote code execution. An attacker who successfully exploited this vulnerability could take complete control of an affected system remotely. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights. Firewall best practices and standard default firewall configurations can help protect networks from attacks that originate outside the enterprise perimeter. Best practices recommend that systems that are connected to the Internet have a minimal number of ports exposed.
This security update is rated Critical for all supported editions of Microsoft Windows 2000 Server, and rated Important for supported versions of Windows XP Professional and Windows Server 2003. For more information, see the subsection, Affected and Non-Affected Software, in this section.
The security update addresses the vulnerability by correcting the way that the LDAP service allocates and frees memory while processing specially crafted LDAP or LDAPS requests.
Recommendation. The majority of customers have automatic updating enabled and will not need to take any action because this security update will be downloaded and installed automatically. Customers who have not enabled automatic updating need to check for updates and install this update manually. For information about specific configuration options in automatic updating, see Microsoft Knowledge Base Article 294871.
For administrators and enterprise installations, or end users who want to install this security update manually, Microsoft recommends that customers apply the update immediately using update management software, or by checking for updates using the Microsoft Update service.
Please apply this update to your Windows 2000 and Server 2003 domain controllers at your earliest opportunity.
- Thoughts on Single Sign On and Credential Providers
-
We use the term single sign on (SSO) to describe a variety of behaviors in Windows and other applications where the result is simply to prevent the user from being prompted to provide their credentials again and again; to ideally enter their credentials only once at initial logon. Active Directory and the integrated authentication which it provides does this very well, and can be extended to other Microsoft applications like SQL, SharePoint, Exchange and others from other companies as well.
There are times, though, when someone needs to create a specific single sign on behavior. This can derive from the need to use a different credential type-smartcards for example, a need to interact with another directory service or application, a need to use one time passwords, or any of a wide variety of things. In those cases you have the option to create and install a credential provider for your client computers and servers for Windows Vista, Server 2008 and later versions. This option is a great thing for developers since programming customized experiences for prior versions of Windows could be more challenging.
We have detailed information on how to develop credential providers available starting with a great MSDN entry. Another great article on that is “Create Custom Login Experiences With Credential Providers For Windows Vista”.
Let’s go over a support scenario which underscores that credential providers alter the default behavior and also gives a technique on how to identify whether an additional credential provider may be involved or not.
In the scenario there was an administrator seeing users unexpectedly receiving the credential prompt when opening a terminal services session to a remote 2008 server from Windows clients. There are multiple reasons this prompt can appear, including broken secure channel on client or server, network problems between client and server, or even problems at the domain controller which is being used to provide the authentication.
So what do you do when you are seeing credential prompts appear unexpectedly and the more common reasons for that are not there?
- See if the issue is consistent. If it is intermittent it is less likely to be solely caused by a credential provider.
- Try removing any added credential providers to see if that makes a difference.
- Remember that in a client-server relationship the server side may be the culprit as well so check whether a credential provider is installed there.
How can you tell what credential providers are present? There are a couple of things you can easily do. First, you can look in the registry under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\Credential Providers and see what entries are present.
This alone may not tell the entire story though since what really matters is whether the credential provider is loaded or not.
Here’s another way to determine what DLLs are loaded and running code in the LogonUI.exe process. My steps were done on a Windows 7 laptop but would remain the same for Windows Vista or Server 2008.
- First, disable UAC temporarily from the Control Panel User Accounts applet.
- Create a folder named SC (for example).
- Create a text file and put the following text into it: tasklist /m /fi "imagename eq logonui.exe" >c:\sc\result.txt
- Rename the file extension of your text file to .CMD
- Next, go to Start-->Run and type “gpedit.msc” (without the quotes) and press enter.
- Go to User Configuration-->Windows Settings-->Scripts and add your CMD script as a Logon script.
- Log off and then log back on.
- Open your result.txt. Keep in mind that the results will find all instances of LogonUI.exe-not only one instance but all instances of it if more than one is present. Below are the results from my test, so these results show default DLLs and no additional credential providers.
Image Name PID Modules
========================= ======== ============================================
LogonUI.exe 5528 ntdll.dll, kernel32.dll, KERNELBASE.dll,
msvcrt.dll, ole32.dll, GDI32.dll,
USER32.dll, LPK.dll, USP10.dll, RPCRT4.dll,
IMM32.DLL, MSCTF.dll, CRYPTBASE.dll,
CLBCatQ.DLL, ADVAPI32.dll, OLEAUT32.dll,
authui.dll, COMCTL32.dll, SHLWAPI.dll,
DUI70.dll, sechost.dll, UxTheme.dll,
gdiplus.dll, DUser.dll, SndVolSSO.DLL,
HID.DLL, MMDevApi.dll, SETUPAPI.dll,
CFGMGR32.dll, DEVOBJ.dll, dwmapi.dll,
xmllite.dll, WindowsCodecs.dll,
WINBRAND.dll, VaultCredProvider.dll,
RpcRtRemote.dll,
SmartcardCredentialProvider.dll,
OLEACC.dll, UIAutomationCore.dll,
PSAPI.DLL, BioCredProv.dll, Secur32.dll,
SSPICLI.DLL, winbio.dll, CRYPT32.dll,
MSASN1.dll, credui.dll, VAULTCLI.dll,
NETAPI32.dll, netutils.dll, srvcli.dll,
wkscli.dll, SAMCLI.DLL,
certCredProvider.dll, CRYPTSP.dll,
rasplap.dll, RASAPI32.dll, rasman.dll,
WS2_32.dll, NSI.dll, rtutils.dll,
rsaenh.dll, SXS.DLL, WTSAPI32.dll,
WINSTA.dll, WinSCard.dll
Where do you go from here if you have noticed that an additional provider is present in the Tasklist.exe result? Try preventing it from being loaded in order to see if that alters the behavior in any positive way. You can prevent the provider from loading typically by backing up (saving) the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\Credential Providers key and then removing the entries related to that provider from it. Alternatively the provider may have a registered installation and removal in the Programs Control Panel applet.
If the provider removal makes a difference where the problem no longer occurs then contacting the manufacturer of that provider is the next logical step since they may already be aware of the behavior that is being seen.
If removing the provider does not help then you can resume troubleshooting this using other methods like network captures, debug logging or other applicable actions.
Additional credential providers add great capabilities to your environment, and along with these may come a bit more needed info for your troubleshooting arsenal. Hopefully this post has added some for you so that you have it when you need it.
- DCs and Network Address Translation
-
A lot of planning goes into the features and capabilities of each Windows release. Over the years I’ve noticed that there is not a great deal of awareness out in the general public for just how much work and labor goes into a new version of Windows. We’ll most often hear someone say something like “Microsoft comes out with a new version of Windows every few years”….a statement which glosses over the concerted effort of thousands of people in planning, writing, testing and documenting each feature and capability of a new release.
What is even less seldom discussed is that there are times when the product is put in situations or used for needs that were not tested or planned for in pre-release. It’s a fact of life in IT, though, that business need dictates the applications to which your servers will be used and the situations they will be in. This is summed up best by the Heinz Guderian quote “No plan of battle ever survives contact with the enemy”-though the “enemy” in this situation is really not that at all, but rather our lifeblood at Microsoft-our customers.
Placing Active Directory domain controllers (DCs) in a network segment that is separated by network address translation (NAT) from the rest of the network-and its peer DCs-is probably one of those situations. There are quite a few distinct business reasons why this sort of network topology will be implemented. The most common is to provide an authenticating domain controller in a demilitarized network zone (DMZ). This can be used to answer authentication requests from application servers which are also placed in the DMZ, or from clients who connecting from outside the DMZ- perhaps via the internet.
What may be the least preferable reason to place a domain controller in a “NATted” network is to use that network to secure an environment. This can be a controversial statement in light of common guidance and advice around securing perimeter networks. We’ll get into specifics a bit later in this blog post and talk about why there are challenges that make this a less preferred thing to do. A good way of thinking of this is to consider that the Active Directory code-all of those separate components that comprise Active Directory-rely on an underlying foundation of network connectivity to be present and working; a foundation of network connectivity which doesn’t take into account network address translation.
The Microsoft guidance on hosting domain controllers separated by a NATted network is summed up by the article “Active Directory functionality is not supported over a router that has Network Address Translation (NAT) enabled”. This does some good general guidance on what network connectivity is required for things to work well.
Now, let’s consider a simple network address translation scenario where domain controllers are involved.
In the picture above we see DC A in our corporate environment and DC B in our DMZ environment. DC A has a network address in the 192.x.x.x network and DC B has a network address in the 172.x.x.x network. In itself this is by no means a problem. The difficulty may come into the scenario when we consider how it is domain controllers are found by clients and by each other: DNS.
Each domain controller will register records in DNS which advertise that the DC can provide services. These are called SRV records. These records in turn map back to a CNAME alias record called the MSDCS record. That record is responsible for providing a resolution to the host record, or A record, which contains the IP address of that DC. The difficult part of a NAT scenario, at least for a domain controller, is that the DMZ DC must register an IP address in its host record which is reachable in some way by its peer domain controller across the network.
Network address translation works to replace IP header information so that the destination IP is actually different than that which the originator knew of. In our scenario a network device or server between the two DCs (represented as a firewall) does the network address translation.
This is an acceptable thing as long as the two servers can ultimately find each other and communicate. But without additional steps taken this won’t happen very well.
DCA = 192.x.x.x
NAT Int = 192.x.x.x
NAT Ext = 172.x.x.x
DMZ_DC = 172.x.x.x
Why is this a problem? Well, DCA can always initiate AD Replication outbound through the NAT device. However, DMZ_DC cannot by default and will fail when it tries. The problem in this scenario is that DMZ_DC resolves the name "DCA" to the internal IP address of 192.x.x.x and cannot reach the DC. That is because NAT is occurring at the network device and if DMZ_DC sends traffic to that IP it simply won’t make it.
In this situation, DMZ_DC needs to resolve the name "DCA" to the NAT Ext IP address in order for this to succeed. However, DCA is registering its name in DNS with its true IP address.
Now, if we manually edit the DNS Host record for DCA on the DMZ_DC side in DNS, it will simply be reregistered by DCA at the next dynamic refresh of DNS by that computer, and then AD Replication will occur in the one way that works consistently and then the problem will exist again.
The idea here is to tell a component on DCA to register both the true IP address AND the NAT Ext IP address (even though the NAT Ext IP Address isn't bound to DC1).
We can do this with the DNS Server component. To do that we need to add a registry value on DCA.
HKLM\SYSTEM\CurrentControlSet\Services\DNS\Parameters
Registry Value: PublishAddresses
Registry Value Type: REG_MULTI_SZ
Registry Value Data:<IP addresses>
The data should contain the IP addresses we need to register separated by a line feed. (IP addresses should be entered on different lines)
Why and how will this work? Because the DNS Server component in Windows Server 2003 and Windows Server 2008 have "Netmask Ordering" enabled by default, so DNS Server will return both IP addresses to the client but will list the IP on the same subnet as the client first. So the clients on the Internal network should choose the correct IP address.
This should also work for the DNS requests from the DMZ_DC. DNS Server will order the list with the IP address on the same subnet as the DNS Client that requested the address, and will chose the next closest IP subnet/class. So on both internal and external sides, the "client" should choose the correct IP address.
As an additional note, support folks at Microsoft often get asked questions around whether we support specific things or ways of doing things. Placing a DC in a DMZ or NATted area is one of those things. Of course, you can always check to see if we support products by going to our web sites on the Product LifeCycle and that teams corresponding blog. But when it comes down to it Microsoft Customer Services and Support is here to help you use our products, so we’ll do everything we can-as long as it is commercially reasonable to do so-to help you. That doesn’t necessarily mean that we will always provide the answer you want to hear but we’ll try very hard to.
The above is a scenario that a business can arrive at and deal with in a lot of different ways. The sole intent here is to provide information that can help you use Microsoft products in the ways that you need to. Administrators and planners may need adapt or add to this scenario as needed but hopefully this gives some knowledge to get you started on that path.
- When Smartcard Logon Doesn't
-
Authentication is entering every facet of our lives nowadays. It is common to have multiple passwords: passwords for work, home email, and Internet websites to name a few. It’s easy to have a lot of different passwords, and equally easy to use only one and risk a widespread identity breach. Passwords are one way of guaranteeing the identity of an individual in a communication but that’s just a start. Two factor authentication is becoming more prevalent in the corporate world, and may someday soon be a part of your daily routine in your home life.
For now, two factor authentication is commonly used as a smartcard plus a user specific password used in an Active Directory domain authentication context. Let’s take a high level refresher of what a smartcard is and how its usually used in an Active Directory domain. Smartcards are essentially a small device which contains a certificate issued to user. The user accesses that certificate by placing the card into a card reader and then supplying a password (or personal identification number) to gain access to that certificate prior to requesting authentication. This certificate, in lieu of the traditional password string of text, is used in communication with the domain for user logon and authentication.
Smartcard logon in part works by having a Domain Controller template based certificate in the authenticating domains local computer certificate stores. In the more straightforward scenario of an Enterprise Certificate Authority, where information regarding the installed CA is stored in the forest AD, the domain controller certificate is auto enrolled to the domain controller as a matter of course. That can make for a nice starting place for configuring smartcard logon to work in your environment.
What if you are a company that maintains a separate certificate authority (CA) from some or all forests and would like to use that CA as an issuer for your smartcard certificates? There are some clear benefits to doing things that way. Foremost would be the ability to use one CA to allow smartcard logon for the users in different forests. This can be useful when you have one unifying corporate structure which has very distinct and separately managed child companies and would also allow for some central control over authentication standards.
Better written and more technical guidance on smartcard logon for domains and how to do it is in the book Windows Server® 2008 PKI and Certificate Security, and also in the KB article Guidelines for enabling smart card logon with third-party certification authorities.
The point of this post is not to discuss the value of configuring smartcard logon and how to do it, but rather to talk about what to do when a specific problem involving smartcard logon occurs.
This is a fairly lengthy premise for a specific problem that you could see: smartcard logon failing while ‘traditional’ credential logon of username plus password succeeds.
There are a few different causes that can make this sort of thing happen but the things you want to look at in order to diagnose what is happening are all approximately the same.
First, when did it happen? This can also be a useful piece of information since it can infer what the cause was. Did the problem start after a reboot of the domain controllers? Did some or all of them seem to fail within a short span of time without a reboot?
In today’s scenario we’ll put forward that the issue occurred following a reboot of the domain controllers, and that we see some interesting events (below) in the System event log of the domain controllers which seem to relate to the problem.
Event Type: Error
Event Source: KDC
Event Category: None
Event ID: 7
Description:
The Security Account Manager failed a KDC request in an unexpected way. The error
is in the data field. The account name was host$ and lookup type
0x0.
..and…
Event Type: Error
Event Source: KDC
Event Category: None
Event ID: 19
User: N/A
Computer: <Computer Name>
Description: This event indicates an attempt was made to use smartcard logon, but
the KDC is unable to use the PKINIT protocol because it is missing a suitable
certificate.
Besides the event logs and the events above one of the most useful tools for this type of issue is Certutil.exe. Certutil.exe is the tool to use in situations where you need to look into the “health” of the certificates in a store.
For this situation you would want to run the command
CertUtil: -verifystore>certverify.txt
In today’s post scenario here’s we do that and see that the private key for the Domain Controller certificate doesn’t appear to be there.
================ Certificate 0 ================
Serial Number: <snip>
Issuer: CN=MS CertSrv Test Group CA, OU=Windows NT, O=Microsoft, L=Las Colinas, S=TX, C=US
Subject: CN=DC5.child2.domain1.com, CN=MS CertSrv Test Group CA, OU=Windows NT, O=Microsoft, L=Las Colinas, S=TX, C=US
Certificate Template Name: Domain Controller
Non-root Certificate
Template: Domain Controller
Cert Hash: <snip>
No key provider information
Missing stored keyset <---MISSING
Verified Issuance Policies:<snip>
Verified Application Policies:
1.3.6.1.5.5.7.3.2 Client Authentication
1.3.6.1.5.5.7.3.1 Server Authentication
Certificate is valid
CertUtil: -verifystore command completed successfully.
So the above appears to be the problem-the private key is missing. Oddly, when we run the verifystore on the other domain controllers of the domain we see the same subject reference of CN=DC5.child2.domain1.com. That in itself is a problem since every domain controller should have its own uniquely issued domain controller certificate.
For DC5 perhaps the association to the private key is the only thing missing. In other words, perhaps the private key and rest of the certificate are both there but just not “linked” to each other for some reason. What can we do to repair that if it can be repaired?
Certutil.exe to the rescue once more. We can use that tool to repair things with the command below, using the serial number value found in the verifystore command. For the domain controller, however, we would need to do that in the DC context which is something you should be able to achieve by using the AT command, either launching a command prompt interactively if allowed or by putting the command below into a batch file and running it that way.
certutil -repairstore my "serial number”
Viola!, we reboot DC5 and suddenly it can service smartcard logon requests. The other domain controllers are another matter though. In a situation such as this the other domain controllers must go through the entire request process for their own Domain Controller certificates.
So what can we hypothesize happened in this scenario? Domain Controller template based certificates are issued to specific hosts and cannot be used on a computer other than the one they are issued to. In light of seeing Domain Controller type certificate Subject field on all of the affected domain controllers all containing the host name of DC5 we can guess that someone exported the certificate from DC5 and imported it to all of the domain controllers in the domain. For the DCs which were not DC5 of course that certificate would never work, and barring the little host uniqueness aspect there the default Domain Controller certificate template does not allow for private key export anyway (which is a good thing). That may also explain why the DC5 certificate had a problem associated with it’s own private key-the certificate was exported without it, and then that exported certificate was imported in, thus breaking the association.
More often you’ll see similar behavior if you are using a third party or non-Enterprise certificate authority and the domain controller certificates expire. That’s a problem that folks with Microsoft Enterprise CAs are not likely to see since the domain controllers will auto enroll in those certificates.
The scenario discussed here is by no means a common one. I’m passing it along, though, to lend some insight into some AD and authentication specific behavior and some troubleshooting that can be applied to a variety of similar issues. Hopefully you won’t see problems, but if you do then let’s hope this info helps you out.
- Taking Out The Trash
-
There will be times when you have to make big changes in your Active Directory. Sometimes those big changes mean deleting a lot of objects. I’ve personally needed to match customer environments by creating tens of thousands of AD objects just to have the beginnings of a matching environment. For my test forests I can leave those objects around after I’m done and not have to worry about things.
But if I have a production forest I will probably want to delete unused objects. I’ll also want to reclaim that disk space and the possible performance from indexes that might be filled with these remaining object references.
A more pointed scenario would be someone who had a maverick provisioning software that created a massive number of unwanted new user objects. These objects would replicate throughout the domain they are in as well as into the global catalogs throughout the forest and would bloat the AD database.
Such a thing could increase a 50Mb Active Directory database to 50Gb one in pretty short order.
Whatever the chain of events that got you to this point you are now in the position of cleanup-deleting all of the unwanted objects. Let’s take a moment to do a quick and high level run though of how the object deletion process works in AD. When an object is deleted by you in Active Directory Users and Computers what really happens to that object is that all but a few attributes of that object are discarded, the object is moved to the Deleted Items container, and it receives a time stamp showing when it was marked for deletion.
This object is retained for a length of time. That time is known as the tombstone lifetime (TSL). At the end of that length of time that object will be removed by a thread that runs on each domain controller at startup and about every 12 hours afterward. That thread is called Garbage Collection. Picture it as a dumpster carrying trash truck that pulls up to each deleted object and quickly examines them to see if the deletion time on them is greater than the TSL or not. If they are then the object goes into the dumpster (figuratively speaking) and is finally deleted and removed from the database. This process is also explained here.
In order to do that very quickly-not having to wait the TSL of 180 or 60 days- you would have to do something we don’t recommend: alter your tombstone lifetime (TSL) to a shorter interval and then garbage collection will remove the objects more quickly next time it runs. We have a KB article that talks about the problems you can see altering this value and why it is generally a bad idea. For the sake of this article we’re going to assume you’re either a Cowboy Admin and have lowered TSL to a small length of time despite Microsoft recommendations, or you have the patience of a saint (Saint Admin? Feels like we should have a patron saint, doesn’t it?).
But once its complete you notice that-while the DIT has decreased a little bit-it hasn’t gotten close to the original size. What’s going on? Didn’t the garbage collection take out all the trash?
The reason that the DIT only decreased a small amount was the result of the dumpster being too small to fit all of that large set of deleted objects into it. There are simply too many deleted objects (which were deleted longer ago than the tombstone lifetime) to fit into the dumpster. Seriously.
When the garbage collection thread runs it takes a batch of 5000 objects that match the criteria of having been deleted greater than the tombstone lifetime ago. Once it has removed that batch from the database it will pause in order to let more important AD business take place. What this means in reality is that only that batch may be done during one garbage collection interval of 12 hours and then you would have to wait for the next collection to see the next 5000 get removed.
Is there a way to speed that process up if you need to? Yes, there is.
You can initiate garbage collection manually by using a published LDAP control. This doesn’t alter the what objects are collected, nor does it alter how may go into the dumpster. It simply says to do garbage collection right then rather than waiting until the next 12 hour interval has passed.
You can use LDP.EXE to do the garbage collection control. Here are the steps:
1. In Ldp.exe, when you click Browse on the Modify menu, leave the Distinguished name box empty.
2. In the Edit Entry Attribute box, type "DoGarbageCollection" (without the quotation marks),
3. In the Values box, type "1" (without the quotation marks).
4. Set the Operation value set to Add and click the Enter button, and then click Run.
It’s possible that the garbage collection you start using the above method could stop in favor of more important tasks like AD replication in the same way as the scheduled garbage collection does. If that happens you can simply repeat the garbage collection steps above until all of the objects are removed.
How can you tell if they are removed? We have a KB article which goes over how to view your deleted objects. Take note that you may need to alter the size limit variable if you have a large number of deleted objects.
What about all of that free space? Can we get it all back just by doing the garbage collection and removing all of the objects that qualify?
Online defrags may reclaim some of that space-an online defrag will occur as part of the garbage collection- but the best thing to do is reboot to Directory Services Restore Mode (DSRM) and run an offline defragmentation of the database.
Keep in mind that garbage collection is not replicated in any way. In other words, the routine you go through for garbage collection and database defragmentation needs to be performed on each domain controller individually. It would not necessarily be a problem to only force garbage collection on some domain controllers and not others but of course you may see performance differences between those that have had the trash taken out and those that haven’t yet.
For IT folks who are remote from their domain controllers there’s a nice little option for booting to DSRM without having to resort to using the F8 on the keyboard (which may be thousands of miles away). Just go to your Run command and type MSCONFIG and press enter. Here’s the option in Server 2003:
…and for Server 2008:
This scenario may not necessarily result from a sudden creation of a huge number objects but could be the result of a gradual database increase over years in production. The resolution portion will not be different in either case. Follow those same steps and just take out the trash.
- Downgrade "Attack"? A little more info
-
I decided that we needed some more detail and to give a walk through scenario on this downgrade attack deal I mentioned a while back in a blog post.
As a recap, a customer called in after noticing the events below appearing intermittently but repeatedly-and always in the sequence of one after the other- in the System event log:
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40960
Date: 01/01/2009
Time: 8:07:01 PM
User: N/A
Computer: FS123
Description: The Security System detected an attempted downgrade attack for server cifs/dc5.sales.adatum.com. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.”
Event Type: Warning
Event Source: LSASRV
Event Category: SPNEGO (Negotiator)
Event ID: 40961
Date: 01/01/2009
Time: 8:07:01 PM
User: N/A
Computer: FS123
Description: The Security System could not establish a secured connection with the server cifs/dc5.sales.adatum.com. No authentication protocol was available.
Of course the part that was most alarming was the attempted downgrade attack text. Attack is not a very friendly sounding word and usually infers that there is a person or identity behind the attack, instigating it. Naturally this is something an administrator would want to follow up on!
Let’s start by defining what a downgrade attack can be. A downgrade attack would be where a connection to obtain a resource starts with an more secure method of authentication but due to some reason must settle for a less secure method of authentication in order to authenticate and gain access to a resource. Kerberos, for example, is a more secure authentication method than NTLM and hence would be preferred and in fact is preferentially selected in security negotiation in every situation where it can be.
The word “attack” though suggests that in every case where we attempt Kerberos and end up using NTLM there was a malicious entity behind that when there generally would not be. There are situations where the 40960 and 40961 event sequence will be useful in identifying actual maliciously inspired behavior but for the most part the cause will be something far less dramatic or evil sounding.
A quick search on the interwebs finds several references to these events. The most informative is here. This Technet event description does a good job of telling us that there can be multiple causes of this event and suggests that it should appear in the event reason code info. The example given is STATUS_NO_LOGON_SERVERS. This is an excellent example since it is probably the most common instigator of this series of events.
So let’s go over a scenario where the 40960 and 40961 can occur from STATUS_NO_LOGON_SERVERS. Picture our file server FS123 is doing it’s normal business as a domain joined member server when a user on it or a service on it suddenly needs to access a file on DC5. FS123 keeps track of where domain controllers for its domain (Sales) are located by having a cache of this information which is maintained by the Netlogon service, and this cache contains information on where a responsive KDC for Sales is on the network.
So naturally when FS123 attempts to access a file on DC5 it negotiates Kerberos as the selected authentication method. That negotiation is what you will see on the network as SPNEGO information embedded in the SMB traffic from FS123 to DC5 and back. When DC5 responds in that SPNEGO response that it supports Kerberos FS123 knows that it needs to get a ticket for DC5 for the file service. In other words it needs a ticket for the service principal name of cifs/dc5.sales.adatum.com, and so FS123 sends that request out to the KDC it knows of in its cache.
But here’s where a problem comes in-the KDC it knows of is not responsive suddenly. As a result the Netlogon service provides a status saying that back to the file request: STATUS_NO_LOGON_SERVERS. The file request then must be completed using another authentication method like NTLM. Our events 40960 and 40961 are then logged in this case in order to show that we attempted this more secure authentication method but were not successful.
In our scenario above the file access and the application or user who initiated it probably succeeded in getting access to that file or files without ever noticing this transaction or a delay. But that leaves us with some questions around why that occurred in the first place? Why were we not able to use Kerberos?
The most common cause for this if the events are seen intermittently is that there is a transient network problem between the client (in our scenario FS123) and the DC it is looking to at that time for authentication. There could be many other causes making that DC less responsive, up to and including the domain controller seeing a performance “spike” and becoming too busy to respond quickly to the Kerberos ticket request from FS123.
From the FS123 side of things the Netlogon service will actually locate a new, more responsive DC when these things occur but there will be a short interval where things like this may happen. That’s the window where occasional events from our topic occur.
So how can you use this information? This can be used as a guideline to understand whether there is a transient issue going on or perhaps an actual intrusion where someone is making the authentication method used for connections intentionally less secure in order to more easily break it. The former (transient issues resulting in our 40960 and 40961 event sequence) is not a surprising thing to see occasionally in an enterprise environment. The latter (maliciously intentional cause) is rare to say the least but a good administrator slash security person will explore each and every one of these events. To do that simply enabled netlogon debug logging on the servers or workstations that see the events and look for corresponding errors occurring at the same time as the events, or look through the event logs for other corresponding events at or around that time.
As a post script, I’ve gotten several great questions from folks via the blog over the past few weeks. I intend to respond to them but have to confess it may be delayed-my apologies for that folks.
- VSS Snapshots and You
-
I find myself doing blog posts on things that are not frequent enough for most experienced admins to be aware of since it wouldn’t come across their desk often. The reason for that is that in my role I receive the least common unresolved issues that occur from our customers. When I receive a few of them over the years I feel that there can be some value in documenting them informally on the blog.
This one is a case in point. For years we’ve used the Volume Shadow Copy Service as the foundation of the backups that we do in our product. I will not claim to be an overall expert in VSS (if you want to be you can go here) but I do want to relate to you how VSS can affect your domain controllers services in some cases during a backup.
Consider a scenario where you have two domain controllers for a specific AD site. These two DCs provide services to an application server and do nothing else-no workstations are in that site, no other member servers, just that one application server and the single application running on it. As part of what that application needs to do it sends frequent and a high volume of LDAP and authentication queries to the domain controllers. This keeps the DCs quite busy indeed on a relatively constant basis; close enough to be having the beginnings of noticeable performance bottlenecks on the DC’s disks during normal usage.
Now add to the scenario a system state backup (Windows Backup or NTBackup) taken on one of the DCs. During the start of the backup you notice that some of the application queries and authentication requests fail for a short period. Not all of them, but enough showing up in your application servers event logs that it raises a concern. You may not even have end user complaints but it is noticed as an issue. These failures only occur during the “preparation” phase of the DC system state backups, interestingly enough, and never exceed 60 seconds.
Preparing to backup

So what is going on?
To understand what is happening we need to understand a few details about how VSS works. VSS basically takes snapshots of the disk data at the time it runs. It is advertised as a seamless backup service-meaning no interruption-because this snapshot is quickly taken and all backup writing and details take place by working with the snapshot, not the live data on disk. This allows the backup process to be seamless to the user since normal services are not being interrupted throughout the backup. The snapshot is what is happening in the ‘preparation’ phase of taking a system state backup using VSS capable backup utilities like Windows Backup.
However, while the snapshot is being taken VSS imposes a temporary halt to disk writes-but allowing disk reads. To be more precise, there is an Active Directory implementation of the VSS snapshot API which works with VSS to do this. Other applications which use ESE databases, like Exchange, have their own implementation of the snapshot code as well. Going forward, for AD that means there is a short interval where no database writes can take place. This period is typically so short as to not be noticed 99.999% of the time, but there are factors which can make this period longer.
Those factors are:
· High disk utilization taking place, indicated by average disk queue lengths being long. This factor would likely be occurring at all times but would spike during the backup process.
· An application which has a low timeout threshold, client side, for its requests and no retry or failover behaviors in case of a temporary lack of response for an action from the DC.
· Lower memory conditions where more of the database is paged onto disk (page file) and would require more disk access to read in order for the snapshot to proceed.
As mentioned above, Exchange has some well documented information about this. That info can be found in the Knowledge Base and at MSDN.
So how can you tell if the behavior you are seeing is related to a scenario like this? We can look in the Event Viewer ESE (database) Freeze and Thaw events in the Application event log during the preparation phase of your backups.
When the backup preparation begins you will see the ESE (remember that ESE is the type of database AD runs as) source event below:

When it ends you will see its companion event:

Note that in some cases the event 2003 above will have slightly different wording which includes the word “thaw”.
More information can be found here. The Freeze and Thaw intervals correspond to the preparation phase of the backup. The pertinent snippet from the above MSDN article is:
Shadow Copy Freeze and Thaw
The creation of every VSS shadow copy operation is bracketed by Freeze and Thaw events, which writers use to put their files in a stable state prior to shadow copy.
Having Freeze and Thaw events as part of the VSS model means:
Handling the Freeze event means that those who are developing writers must have a clearly delineated point in the backup cycle where they ensure that all write operations to the disk are stopped and that files are in a well-defined state for backup.
Handling the Thaw event provides the mechanism for writers to resume writes to the disk and clean up any temporary files or other temporary state information that were created in association with the shadow copy.
The default window between the Freeze and Thaw events is short (typically 60 seconds); therefore, actual interruption of any service that a writer provides can be minimized.
Handling of other events (such as PrepareForSnapshot) preceding and following the Freeze and Thaw events, respectively, provides the necessary flexibility to allow writers to complete complicated operations to support shadow copies.
How can you tell that this issue is affecting you? If you have application side behavior that correspond to the events 2001 and 2003 then it’s time to do some performance logging on your domain controllers and look for performance bottlenecks. Server Performance Advisor or the Perfmon AD Data Collector in Server 2008 tests ran during the backup are also a good tool for getting a handle on what is going on.
What can you do if you have verified that you are seeing this unusual issue? Here’s what I would recommend:
· Alter the application behavior to better accommodate an occasional delay in server responses from DC.
· Consider moving to x64 platform for the DCs, with more RAM and augmented by more robust drives and network devices. This should make the VSS freeze and thaw intervals even less perceptible.
· Decrease the frequency of the backups for those domain controllers only as a last resort.
Hopefully this helps in another less common scenario and gives a better understanding of how things work under the hood in AD.
- Gauging Size Differences in AD Databases
-
We occasionally receive support calls which revolve around the topics of “why is the Active Directory database on DC A different in size than that on DC B?”. It’s easy to dismiss the question out of hand but there are real life scenarios where this can be an important question. And there are real life AD uses that can bring you to the point where you are asking that question.
In previous blog posts, for different reasons, I’ve occasionally touched upon the fact that AD is stored essentially on one monolithic file, the NTDS.DIT. Each domain controller in a domain and forest contains its own copy of that database that is continually updated with changes from its peers. That updating process is termed AD replication.
What causes occasional consternation is when it is noticed that the file sizes of the Active Directory database files are different from domain controller to domain controller. Or that the Active Directory database is suddenly and unexpectedly growing at a quick rate on all DCs. In itself the database size being different is not a bad thing-they will never be precisely the same size on disk. There is actually a term for excessive database growth that can be applied to extreme examples where one or more DCs are growing in database size much more quickly that others: DIT bloat. For those times when there is a larger discrepancy this blog post will give you a few techniques you can apply to get more information about what is happening and why.
Since most data in AD needs to be kept consistent from DC to DC we’ll start with AD replication. Naturally repadmin.exe is our weapon of choice here, as it is the Swiss Army knife of AD replication. Repadmin.exe is primarily useful in sizing questions when used to verify that the different replicas are actually in synch.
Consider a forest that has a few global catalogs that are noticed to have AD databases which are substantially (say fifty percent) smaller than their peers. That is certainly worth looking into. That discrepancy can occur as the result of those GCs being out of synch for an extended period or following a migration or provisioning. In other words the databases aren’t the same size because they haven’t received all of the updates that would make them larger, or has updates that should have been garbage collected but have not.
As in other instances we would simply want to use the command below to see if the GCs with larger directory sizes on disk are in synch or not with their peers:
Repadmin /showrepl * /csv >repl.csv
Some data in Active Directory is not replicated however. That may sound strange but is true nonetheless. Examples of data that is not replicated are the indexes used to assist database searches. Though the data which is referenced in the index is replicated the index itself will be compiled and to a certain extent unique on each domain controller replica. This is true even when an attribute is arbitrarily indexed by someone for their own business needs. Marking an attribute to be indexed, and how it can be indexed, is a change to the schema object for that attribute and is replicated via AD replication of the schema naming context. The actual index itself, once made, is never replicated but is instead compiled and maintained separately on each individual DC of the forest.
This may lend itself to a certain amount of difference in each DCs database size on disk. In most cases the index sizes should not be seen to be enormously different and if they it could indicate that the database needs to be checked for problems.
To do that we should use NTDSUTILs nice command of Files-->Integrity after booting to Directory Services Restore Mode. When ran this command gives a file %systemroot%\ntds\NTDS.INTEG.RAW which may have interesting data about the database health.
In addition it can also help to use the Semantic Database Analysis “Go Fixup” command. It too puts out a little more information; that information is stored in a sequentially numbered file named DSDIT.DMP.X.
There may be a few people chafing at the bit, waiting for me to talk about another situation: when the database suddenly begins growing at such a rate as to raise concerns about filling up very large hard disks. How do you find out what is taking up so much space and what is causing that behavior?
Of course you could always monitor replication, but you already knew that. One different method would be to use the DSASTAT.EXE tool. DSASTAT has been around as long as AD has but is a tool that has a limited range of uses and is not well known. The syntax you can use is a little self evident:
C:\>dsastat -loglevel:debug -output:both
Most DSASTAT information you get from that command will not be useful for this concern but the final information which appears under the header of “DSA Diagnostics” may be. Here’s a snippet from that:
-=>>|*** DSA Diagnostics ***|<<=-
Objects per server:
Obj/Svr ADFSACCOUNT Total
builtinDomain 1 1
classStore 1 1
computer 2 2
container 82 82
dfsConfiguration 1 1
<snip>
organizationalUnit 2 2
rIDManager 1 1
rIDSet 1 1
rpcContainer 1 1
samServer 1 1
secret 5 5
user 8 8
---
Total: 201 201
. . . . . . . . . . . . . .
Bytes per object:
Object Bytes
builtinDomain 161
classStore 155
computer 1164
container 15225
<snip>
organizationalUnit 465
rIDManager 153
rIDSet 135
rpcContainer 164
samServer 153
secret 956
user 2328
. . . . . . . . . . . . . .
Bytes per server:
Server Bytes
ADFSACCOUNT 49586
Information from DSASTAT is a snapshot of the state of a domain controller at a particular time. For unfettered growth issues it’s going to be more useful to get a sequence of snapshots using DSASTAT taken over a period of time that the growth is seen to occur within. Once you have them it’s a simple matter to compare them and see which number is getting bigger progressively over time.
We have another tool in our arsenal for AD sizing concerns: ESENTUTL /MS. I believe I’ve mentioned how NTDSUTIL.EXE is the AD specific version of ESENTUTL.EXE before, and that generally it’s a bad idea to use ESENTUTL rather than NTDSUTIL. This is an exception to that rule. The value the ESENTUTL /MS command gives is that you can see the size of the indexes which is not something the DSASTAT command above gives. Here’s a sample from that tool:
Microsoft(R) Windows(R) Database UtilitiesVersion 5.2Copyright (C) Microsoft Corporation. All Rights Reserved.Initiating FILE DUMP mode...
Database: c:\windows\ntds\ntds.dit
******************************** SPACE DUMP ***********************************
Name Type ObjidFDP PgnoFDP PriExt Owned Available
===============================================================================
c:\windows\ntds\ntds.di Db 1 1 256-m 1536 210
datatable Tbl 8 35 90-m 1126 45
<Long Values> LV 66 86 1-m 126 41
Ancestors_index Idx 15 42 1-m 19 0
clean_index Idx 28 46 1-s 1 0
deltime_index Idx 12 39 1-s 1 0
DRA_USN_CREATED_index Idx 14 41 1-m 13 0
DRA_USN_CRITICAL_inde Idx 30 48 1-s 1 0
DRA_USN_index Idx 29 47 1-m 13 0
INDEX_00000003 Idx 118 693 1-m 41 10
<snip>
MSysUnicodeFixupVer1 Tbl 6 33 2-s 2 0
secondary Idx 7 34 1-s 1 0
quota_rebuild_progress_ Tbl 125 911 2-s 2 1
quota_table Tbl 124 909 2-s 2 1
sdproptable Tbl 19 237 2-m 6 1
clientid_index Idx 21 241 1-s 1 0
trim_index Idx 20 238 1-s 1 0
sd_table Tbl 22 243 2-m 36 11
<Long Values> LV 123 713 1-m 18 5
sd_hash_index Idx 23 244 1-s 1 0
-------------------------------------------------------------------------------
463
Operation completed successfully in 1.372 seconds.
More data than you were likely hoping for appears in the above result but the key takeaway is to look for the growing numbers in the Owned columns and perhaps decreasing numbers in the Available columns. Owned relates to the total number of pages in the database for that index or table that contain data. Available is the amount of space left for growth. Similar to the DSASTAT we can run the ESENTUTL /MS commands sequentially over a period of database growth to see which Owned column is increasing over time.
A veritable flood of ESE data can be found here. Be careful not to overdose on database specific information when reading that article.
In this post we’ve gone over a few different things which can easily be done to get a handle on or better understand your Active Directory database. It’s important to keep in mind that how you use your database is the most relevant piece of information you can apply to any database concern you see. The ‘classic’ example if placing photos of the user into the user object in AD. This is sure to increase the size of the object itself, and greatly increase each replica’s NTDS.DIT.
Until the next post, take care out there, and a belated Happy Valentine’s Day!
- Tabula Rasa
-
I was well and truly stumped a few months ago. I joke that once a year I am flat out wrong, and rarely do I have nothing to say on a subject. The 'once a year I may be flat out wrong' statement may be true simply because after 15 years in the IT industry I’ve learned to avoid letting broad definitive statements out of my mouth unless I am certain. I also rarely say something is impossible.
Too frequently in the past I’ve been proven wrong after such proclamations. Oh the embarrassment in the IT world if you are not accurate!
So after reviewing the issue below I was stumped, out of ideas, stymied, at a loss, and bewildered. Here’s the deal.
Our customer had recently rolled out her first Server 2008 domain controllers. They didn’t use DCPROMO “”answer files”, custom builds or images of the operating system, and the DCs were not installed as Core or Read Only DCs. Shortly after they were promoted though the DCs would respond oddly. For one thing it was noticed that workstations were not gaining services from these DCs for the most part (file access worked well, but LDAP binds would fail). The big symptom that was being seen was that the recently created objects in the Active Directory were not being received by the new DCs via AD replication. These odd circumstances were only occurring with the 2008 DCs, and didn’t seem to happen immediately following promotion.
You can imagine the disappointment of the the users who's accounts those were. There were also DNS Server events (yes, this server also hosted DNS as do many DCs):
Log Name: DNS Server
Source: Microsoft-Windows-DNS-Server-Service
Date: 9/16/2008 3:41:01 PM
Event ID: 4000
Task Category: None
Level: Error
Computer: dc21.child2.forest1.com
Description:
The DNS server was unable to open Active Directory. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
Log Name: DNS Server
Source: Microsoft-Windows-DNS-Server-Service
Date: 9/16/2008 3:41:01 PM
Event ID: 4015
Task Category: None
Level: Error
Computer: dc21.child2.forest1.com
Description:
The DNS server has encountered a critical error from the Active Directory. Check that the Active Directory is functioning properly. The extended error debug information (which may be empty) is "000004DC: LdapErr: DSID-0C0906DD, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1771". The event data contains the error.
Rather than speculate needlessly I asked the customer to run the Microsoft Support Diagnostic Tool (MSDT) against the problematic DCs. Here were some of the results. I’m including some of the things that were successful but should have been failures as well if there was some catastrophic thing going on for these DCs.
Testing server: Columbia\DC21
Starting test: Connectivity
* Active Directory LDAP Services Check
Determining IP4 connectivity
Determining IP6 connectivity
* Active Directory RPC Services Check
......................... DC21 passed test Connectivity
Starting test: Advertising
The DC DC21 is advertising itself as a DC and having a DS.
The DC DC21 is advertising as an LDAP server
The DC DC21 is advertising as having a writeable directory
The DC DC21 is advertising as a Key Distribution Center
The DC DC21 is advertising as a time server
The DS DC21 is advertising as a GC.
......................... DC21 passed test Advertising
So the above tests and errors showed that the DNS Server service couldn’t start because the Active Directory wasn’t running successfully, but the normal tests which show whether the Active Directory is working were claiming it was fine.
So I remotely connected and attempted to do an LDAP bind to the local DC using LDP.EXE but got this response:
Error: An LDAP lookup operation failed with the following error:
LDAP Error 49(0x31): Invalid Credentials
Server Win32 Error 2148074252(0x8009030c): The logon attempt failed
Extended Information: 8009030C: LdapErr: DSID-0C0904D1, comment: AcceptSecurityContext error, data 52e, v1771
Now that was really odd given that the DCDIAG tests above were succesful. In most situations where an error is seen binding that error is repeated in diagnostic tests. But not in this one.
By stopping the Kerberos Key Distribution Center service and flushing the Kerberos tickets we were able to see in a network trace that DC21 was requesting a Kerberos ticket for the service LDAP/local. No DC registers this service principal name for itself, and when I checked in AD I was able to confirm that there was no SPN register by that name in DC21’s servicePrincipalName attribute.
My debugging skills were enough to tell that the request and subsequent failure for a Kerberos service ticket using the SPN LDAP/local was probably the problem. But the question was why was it doing that and how could we make it stop? No configuration of the network interfaces had anything like that “local” thing, and there was no record in DNS of that unlikely name.
I was stumped. So I asked our Global Escalation Services folks to apply their stronger debugging skills to this issue. Joey Seifert from that team obliged (kudos to him for shedding light on this).
You’ll never guess what it was….or at least I didnt.
This was caused by an entry in the %systemroot%\system32\drivers\etc\hosts file. That entry was “127.0.0.1 local localhost”. The 2008 DCs which were failing all had this entry in a HOSTS file which was munging the Kerberos SPN which would be used in the ticket request. An expected, working entry would be be slightly different: "127.0.0.1 localhost". As a result, the ticket request was unsuccessful and the DC could not allow it's local service to bind to AD via LDAP since that ticket wasn't there.
I've mentioned in the prior blog posts that DNS is important to Kerberos authentication. Here's a real life example.
The moral of this story? Never assume you know everything. Once you do you’ll never succeed or learn anything more.
- Fooling the DC Locator
-
There are an ever increasing number of scenarios out there in the business world where two different companies, or company divisions, may be using Active Directory for their directory service but may not be setting up an actual trust between the two. A more common reason for that is the different company idea-two different companies need to be able to have users access resources in each other’s environments but must minimize that access as much as possible.
A hypothetical business scenario for that would be a washing machine manufacturer needing access to its part suppliers in order to communicate needs and feedback for parts-two disparate organizations which must cooperate to a certain extent as part of their business model. If that example was a company that made its own parts for the washing machines then it could have an intraforest domain for the parts division(s), but in the case of an external parts supplier it would be an external or cross forest Trust (the capital T will be used throughout this blog post to indicate the Active Directory Trust-not the state of mind) which goes well beyond the actual trust (meaning the feeling, not the IT word) you have in that supplier.
I’ve been involved in many scenarios over the years that could have been avoided by limiting the trust based on the following rule: The guideline for whether to establish a Trust, and which Trust to do, with another organization is that it should never exceed the practical trust you have in who you are Trusting.
Though not the topic of today’s blog post I want to point out that we have an awesome product that fills this gap perfectly: Active Directory Federated Services (ADFS).
So what do you do if you haven’t had the opportunity yet to roll out ADFS but still need access to another organization’s resources, or need to allow access to yours? Though not recommended, you could use “pass through” authentication and create user accounts of the same name and password in the two environments and use them to allow access. This would not work when Kerberos is the authentication mechanism but it’s unlikely that Kerberos will work anyway barring a cross forest trust. And certainly there are other methods.
There’s an additional aspect to configuring Trusts, though, that is normally not thought of until you notice it’s absence. That is that accessing those resources across a Trust is typically a pretty quick action. Much of that flows as a result of the requirements Active Directory Trusts have for DNS name resolution.
Picture a series of actions that kick off when a client in organization A tries to access a resource in organization B with a network topology that meets somewhere but without a Trust or Federation. We’ll use the example of connecting to the opposite domain and running a scripted LDAP query (maybe to create objects, perhaps to query for attribute values of objects) which passes in a URL like LDAP://domain1.forestb.com from our organization A client. We need to find the IP address of a server which answers to that moniker, don’t we?
So, in the absence of an AD Trust, and its DNS requirement, what’s there to keep an access from one organization from taking a longer time than necessary as the query to locate domain1.forestb.com goes hither and yon throughout the network-and perhaps even out to the Interwebs-to get resolved for this client?
Consider the delay that may ensue as the client attempts to find the remote DC’s IP.
Now let’s consider a way that we may be able to make answer to that DNS query quicker-or successful depending on your situation.
A basic requirement for network access of the two organizations is that they must have some network location where the two organizations mutually connect. To get things to work quicker the solution is to you must start by having a DNS server that both organizations mutually forward to, perhaps in that network location or DMZ where they coincide.
Well, that will get it to work but that would probably not be as quick as connecting to a resource in your own organization. To speed that up you can do a little trick that we’ll call “fooling the DC Locator”.
I’ve talked about the DC Locator code a bit in other blog posts, mostly in discussions of other behaviors and issues. The DC Locator runs in the Netlogon service of every Windows computer and is used to keep a running tab of where the closest domain controllers are and whether they are responsive.
A lot more detail on the DC Locator process is here. I encourage any person who is tasked with administering Active Directory to spend time reading the above link. It may occasionally come in handy as a reference as well.
For the purposes of our scenario the portion below is most relevant (taken from the Technet link above):
Domain Controller Location in the Closest Site
During a search for a domain controller, the Locator attempts to find a domain controller in the site closest to the client. When the domain that is being sought is a Windows Server 2003 domain, the domain controller uses the information stored in Active Directory to determine the closest site. When the domain being sought is a Windows NT 4.0 domain, domain controller discovery occurs when the client starts and uses the first domain controller that it finds.
Each Windows Server 2003–based domain controller registers DNS records that indicate the site where the domain controller is located. The site name (the relative distinguished name of the site object in Active Directory) is registered in several records so that the various roles the domain controller might perform (for example, global catalog server or Kerberos server) can be associated with the domain controller’s site. When DNS is used, the Locator searches first for a site-specific DNS record before it begins to search for a DNS record that is not site-specific (thereby preferentially locating a domain controller in that site).
A client computer stores its own site information in the registry, but the computer is not necessarily located physically in the site associated with its IP address. For example, a portable computer that was moved to a new location could contact a domain controller in its home site, which is not the site to which the computer is currently connected. In this situation, the domain controller looks up the client site on the basis of the client IP address by comparing the address to the sites that are identified in Active Directory, and then returns the name of the site that is closest to the client. The client then updates the information in the registry.
Why is that some important? Because of the importance of finding a domain controller in an AD site which is close to the client. That seems like a non sequitur since if the two organizations are not in the same forest then they couldn’t possibly have a common Active Directory site or DCs within that same site, but the client has a routine behavior that requires it to check.
So, if you were to create a site in organization A which matches the name of the site in AD for organzation B (where our target server5.domain1.forestb.com resides) then the DNS query the client will use to locate that server will be answered much more quickly.
Here are the reasons for that:
1) The DNS server to answer the query is present (even if just by a forwarder).
2) The client constructs it’s DNS query exactly as if it was a client in organization, or forest, B.
Why does the second reason occur? Because we first try a DNS query to find a DC within the same AD site as the client. Under the hood the initial query to find a domain controller to respond to the client needs consists of two basic parts: the destination domain for the resource (supplied in our LDAP URL), and the AD site the client resides in.
These two parts are combined into a DNS query for a site specific SRV record which matches like _ldap._tcp.SiteName._sites.dc._msdcs.DnsDomainName. In our case, if the site we created in both organizations was named Tacoma, then the client would query for _ldap._tcp.Tacoma._sites.dc._msdcs.domain1.forestb.com. Since this is an actual record that exists on the DNS server (or one that we forward to) the response will be answered much more quickly than a query which asks for an AD site that doesn’t exist at all in the destination domain.
Active Directory was not designed or tested for this behavior to be used in that way but it can work well sometimes for that need. The best thing to do, when you are in a situation where you must give limited access to a partner organization is to decide whether you trust that other organization enough for a Trust, or to set up and use ADFS to fill your needs. Either resolution is more robust and (and tested!) to work far better than fooling your DNS topology to speed up your queries.
- Too Much of a Good Thing
-
A while back I wrote a blog post about setting up Kerberos constrained delegation. As a bit of an re-introduction, a lot of the value of the Kerberos authentication protocol is that it allows an application or service to impersonate a user in order to get resources on that users behalf. This impersonation is also called delegation, and is most commonly seen in the “trusted for delegation” settings.
Active Directory in Server 2003 introduced new Kerberos capabilities. One of those was constrained delegation. Constrained delegation assumes that scenario where an application or service is impersonating a user in order to resources on that users behalf, but limits or constrains that impersonation so that it can only go to specific remote resources. The term “double hop”, though a bit over used in Kerberos discussions, comes in handy in this explanation because we can simply say that the application or service that is impersonating our users (the first hop) can only do its second hop to specific computer(s) and get specific resource(s) on that remote host.
This feature is sometimes referred to by the acronym KCD, or Kerberos Constrained Delegation. As you know, nothing worthwhile should be without an acronym.
This is an added layer of security. Consider one of your impersonated users to be the head of your accounting department. If this user is impersonated at a web server, for example, but that web service account is constrained to delegate only to a SQL server and service then that web service account couldn’t connect to another remote computer-or even that SQL server-and access an Excel spreadsheet as the head of accounting. That is because the SPNs are for the SQL service use MSSQLSvc and access to remote files across the network uses the CIFS service and SPN. That scenario assumes that the web service code or identity has been compromised in some manner already of course or else it would not be trying to do something naughty like that it in the first place.
So KCD is about limits; limiting where an identity can impersonate a user. The logical decision in implementing this is to make the number of places that the service can impersonate the user a small number. This would limit the security exposure of that impersonated user. This is a logical limit from a security perspective since if you plan to impersonate users to a large number of “second hop” hosts and services then why use constrained delegation at all? Why not use “traditional” delegation without constraint?
I’m revisiting constrained delegation in this blog post because I learned something new about it that I wanted to pass along. I learned that there is practical limit to constrained delegation as well.
Here’s how this came up. This customer has a scenario where they were required to use constrained delegation but were uncertain of the full list of services which they needed to add to the Delegation tab for the service account. They knew that there was a set list of remote servers which they would need to be trusted for delegation to (a bit less than 20 different servers) but just didn’t know which services to specify. In the user interface for this in Active Directory Users and Computers (dsa.msc), you’ll recall, the object picker presents you with a choice of all of the services that a server hosts once you select that server in the Delegation tab setting for Kerberos. This list is drawn from the servicePrincipalName (SPN) attribute of that server.
When in doubt select them all if you need to have things working in a hurry. That’s just what this customer did, however when they were adding the last server to that list they clicked OK but got the error below:
The second part of this message “The administrative limit of this request was exceeded” is one that you may recognize as a LDAP type response. The big question was why are we running into this message when we try to add more constrained delegation entries?
To answer that we need to discuss what actually happens when adding entries for constrained delegation in the Delegation folder tab of a service account. When you select a computer to delegate to and choose the services for it as well what really happens is that the AD attribute of msDs-AllowedToDelegateTo (also known as A2D2) for that service account gets those servicePrincipalNames added to its list.
Any attribute in AD has a schema definition which defines what properties and behaviors that attribute will have. For msDs-AllowedToDelegateTo it is here . Here’s a paste of the part of that schema definition which is relevant to this discussion:
CN ms-DS-Allowed-To-Delegate-To
Ldap-Display-Name msDS-AllowedToDelegateTo
Size 0 to 64K
Update Privilege -
Update Frequency Infrequently
Attribute-Id 1.2.840.113556.1.4.1787
System-Id-Guid 800d94d7-b7a1-42a1-b14d-7cae1423d07f
Syntax String(Unicode) .
I’ve put a portion above into bold text-the size information. Not all attributes define a maximum size in the schema, but here we see one. So that tells us that the maximum sized set of data we can put into that attribute is 64K of Unicode data. How many does that really equate to?
Below is the approximate amount I was able to add in my test environment before I also reached the size limit related error, dumped from my test service account using LDP.EXE.
Expanding base 'CN=SvcAccount,CN=Users,DC=adatum,DC=com'...
Result <0>: (null)
Matched DNs:
Getting 1 entries:
>> Dn: CN=SvcAccount,CN=Users,DC=adatum,DC=com
5> objectClass: top; person; organizationalPerson; user; computer;
1> cn: SvcAccount;
1> distinguishedName: CN=SvcAccount,CN=Users,DC=adatum,DC=com;
<snip>
1> sAMAccountName: SVCACCOUNT$;
1> sAMAccountType: 805306369;
1> objectCategory: CN=Computer,CN=Schema,CN=Configuration,DC=adatum,DC=com;
1> isCriticalSystemObject: FALSE;
2> dSCorePropagationData: 07/30/2008 09:58:24 Central Standard Time Central Daylight Time; 30650/29691/8424 21052:37:2544 UNC;
1189> msDS-AllowedToDelegateTo: msiserver/testcomputer106.adatum.com; msiserver/testcomputer107.adatum.com; msdtc/testcomputer103.adatum.com; msdtc/testcomputer104.adatum.com; msdtc/testcomputer105.adatum.com; msdtc/testcomputer106.adatum.com; msdtc/testcomputer107.adatum.com; messenger/testcomputer103.adatum.com; messenger/testcomputer104.adatum.com; messenger/testcomputer105.adatum.com; messenger/testcomputer106.adatum.com; messenger/testcomputer107.adatum.com; mcsvc/testcomputer103.adatum.com; mcsvc/testcomputer104.adatum.com; mcsvc/testcomputer105.adatum.com; mcsvc/testcomputer106.adatum.com; mcsvc/testcomputer107.adatum.com; iisadmin/testcomputer103.adatum.com; iisadmin/testcomputer104.adatum.com; iisadmin/testcomputer105.adatum.com; iisadmin/testcomputer106.adatum.com; iisadmin/testcomputer107.adatum.com; ias/testcomputer103.adatum.com; ias/testcomputer104.adatum.com; ias/testcomputer105.adatum.com; ias/testcomputer106.adatum.com; ias/testcomputer107.adatum.com; http/testcomputer103.adatum.com; http/testcomputer104.adatum.com; http/testcomputer105.adatum.com; http/testcomputer106.adatum.com; http/testcomputer107.adatum.com; host/testcomputer103.adatum.com; host/testcomputer104.adatum.com; host/testcomputer105.adatum.com; host/testcomputer106.adatum.com; host/testcomputer107.adatum.com; fax/testcomputer103.adatum.com; fax/testcomputer104.adatum.com; fax/testcomputer105.adatum.com; fax/testcomputer106.adatum.com; fax/testcomputer107.adatum.com; eventsystem/testcomputer103.adatum.com; eventsystem/testcomputer104.adatum.com; eventsystem/testcomputer105.adatum.com; eventsystem/testcomputer106.adatum.com; eventsystem/testcomputer107.adatum.com; eventlog/testcomputer103.adatum.com; eventlog/testcomputer104.adatum.com; eventlog/testcomputer105.adatum.com; eventlog/testcomputer106.adatum.com; eventlog/testcomputer107.adatum.com; dnscache/testcomputer103.adatum.com; dnscache/testcomputer104.adatum.com; dnscache/testcomputer105.adatum.com; dnscache/testcomputer106.adatum.com; dnscache/testcomputer107.adatum.com; dns/...
-----------
Using LDP to display an object is handy for this since it shows us the number of entries present in a multi valued attribute. In mine, above, I have successfully added 1189, but when I try to add a few more I get the size limit error. That tells us that the limit is around 1100 or so entries but is there a better way to see this?
Some folks like the Support Tool DSAStat.exe for that. It is not my favorite tool ever, mainly since it can be cryptic to read. Using it to view this attribute with the command below is no different.
C:\>dsastat -b:CN=SvcAccount,CN=Users,DC=adatum,DC=com -gcattrs:msDS-AllowedToDelegateTo >dsastat.txt
Most of the results are not applicable to this concern, but here’s the part that helps here:
Bytes per object:
Object Bytes
NULL-OBJECTCLASS 30575
. . . . . . . . . . . . . .
Bytes per server:
Server Bytes
ADFSACCOUNT 30575
For Unicode attributes we take approximately twice that amount for storage, so that takes us to our maximum amount. So in order to add more entries once we’ve reached that point we need to remove some entries from that list-we’ve reached the limitation of what the directory will allow into that attribute per its schema definition.
This begs a further question of whether so very many entries should be present in A2D2 to begin with. The logical answer for that is ‘no’. Adding a large number of entries into the A2D2 or msDs-AllowedToDelegateTo list ultimately works around the reason to use constrained delegation in the first place: limiting where the trusted identity can impersonate the user. To reduce this attack surface area the alternative idea is to use different service accounts rather than a single one, or simply try not to add services into the list which you don’t need. We’ll call that latter idea the “Don’t Select All” idea.
For those that will be out of the office over the next week I wish you an early Happy Holidays! I hope you have a safe, warm and relaxing time.
- Scary Sounding Errors
-
We have a temporary role in CSS where support folks will help out in supporting prerelease (also known as beta) software. I’ve worked a couple of Windows betas, and it’s a great experience. I mention this since I remember a few years ago during the beta of a prior Windows release where there was an initiative to make the errors that would appear in Windows less scary sounding.
Most people can recall the infamous problem descriptions like “fatal error”. Fatal sounds pretty final, doesn’t it? Tough to recover from “fatal”.
But really that’s the point of this blog post. Though we try very hard to make things intuitive and not frightening sounding, not all of our errors or problem descriptions end up sounding like they really should.
Here’s a case in point. One of my colleagues asked me to check into some events that a customer had some concern over. They weren’t seeing an actual failure of service, they simply wanted to make sure that these events (which would occur in pairs intermittently) didn’t mean something really bad. Because they sounded pretty scary.
12/1/2008 10:43:33 AM
LSASRV Warning SPNEGO (Negotiator)
40961 N/A Server21
The Security System could not establish a secured connection with the server cifs/Server12.child.adatum.com. No authentication protocol was available.
12/1/2008 10:43:33 AM
LSASRV Warning SPNEGO (Negotiator)
40960 N/A Server21
The Security System detected an attempted downgrade attack for server cifs//Server12.child.adatum.com. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.
The scary part is the phrase “downgrade attack”. Let’s face it-the only time an attack isn’t a bad thing necessarily is when it’s a Big Mac Attack. Hmmmm. Big Mac.
Anyway, let’s stay focused here. What is a “downgrade attack”? And why were these events occurring?
This error occurred as a result of the computer Server21 trying to access a file resource on the remote server Server12.child.adatum.com. The negotiated authentication protocol for that connection was Kerberos. Unfortunately, after that negotiation between the client and that remote server Server12.child.adatum.com occurred, the client (Server21) was unable to request the needed Kerberos ticket cifs/Server12.child.adatum.com because there were no DCs available at that moment to send the request to.
Why was that? Don't know. It could have been a temporary or transient issue on the DC side, the network between client and DC, a trust issue, a DNS issue, or even a performance issue on the DC. The general idea is that the DC locator running in the Netlogon service of the client in this case was aware it couldn’t find a DC for the child.adatum.com domain at that time.
So why is this termed an “attack”? It probably shouldn’t be. Whatever client side code on Server21 which instigated the need to access some file on Server12.child.adatum.com probably just tried once to access the file, relying on the security subsystem to negotiate the authentication to that resource. When that failed it didn’t automatically fail over to NTLM necessarily-unless the application side allowed said to. It’s possible that the connection did or would succeed using NTLM (which is less secure) since NTLM doesn’t require communication with a domain controller. Yes-access to that file resource may have actually succeeded seamlessly other than the above events after all. So this isn’t an attack, but certainly is not a preferable thing to have happen. Maybe we should call it an obstruction or failure.
To troubleshoot this scenario, if I was seeing it intermittently or even very consistently, I would start by enabling Netlogon debug logging on the client (in this case Server21) and look to see what behavior or error codes are shown in that log for the time indexes which correspond to the same time as the above events.
That’s not all that scary, now, is it?
- Name Hijacked, Bystander DC Hangs
-
I learn more about AD and other things every day, which is part of the fun of this job we do-learning about how things work. This story does a good job of lending some understanding to something that can be tough to understand-trust secure channels.
This story begins with a customer contacting us regarding a problem they were suddenly confronted with where some of the domain controllers of one domain would hang and become unresponsive. It was impossible to tell when this would happen, but when it did happen that domain controller would no longer provide any client services until it was rebooted. The result would be that Windows clients would have errors communicating with that DC for about 15 minutes (for things like group policy processing if that was their time for that, user logon, or authentication requests) and then they would find another DC to service their requests. For applications which were specifically set to look to the failing DC, however, the failure would continue until the server would be rebooted. A quick glance in Task Manager for the affected domain controller would show that it did not seem very busy.
Additional symptoms were the events below from the clients who would otherwise be looking to that DC. These events can occur from a wide variety of causes and as a result were definitely interesting but did not narrow the playing field significantly enough to figure out the problem.
Event ID : 5783
Raw Event ID : 5783
Category : The operation completed successfully.
Source : NETLOGON
Type : Error
Generated : 3/1/2008
Written : 3/1/2008
Machine : MemberSrvr9104
Message : The session setup to the Windows NT or Windows 2000 Domain Controller \\DC1.child1.haybuvtoys.com for the domain Haybuvchild1 is not responsive. The current RPC call from Netlogon on \\DC1 to \\DC1.child1.haybuvtoys.com has been cancelled.
Event ID : 5719
Raw Event ID : 5719
Category : The operation completed successfully.
Source : NETLOGON
Type : Error
Generated : 3/1/2008
Written : 3/1/2008
Machine : MemberSrvr9104
Message : This computer was not able to set up a secure session with a domain controller in domain Haybuvchild1 due to the following:
%%1722
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.
ADDITIONAL INFO
If this computer is a domain controller for the specified domain, it sets up the secure session to the primary domain controller emulator in the specified domain. Otherwise, this computer sets up the secure session to any domain controller in the specified domain.
NOTE: The above events have a plethora of different possible causes.
For many of these scenarios we start asking if this issue occurs over time, which would suggest a resource bottleneck for some reason. But in this case there was no similar time frame this would occur in, and the fact this would occur on any DC made that much less likely anyway. Due diligence with performance data take over time on these DCs (link to some info on PERFWIZ) ruled that out.
Our next considered step was the sometimes dreaded Memory Dump. We won’t be going into extreme detail about examining what is in memory from a Windows server here, or the various techniques and tools we use for that, but we do need to mention a little bit. A performance guru (which I am not) can use a full memory dump, usually in conjunction with Perfmon performance data gathered leading up to the memory dump, to look for bottlenecks, memory leaks and other resource contention type issues. That was not what our goal was in getting a memory dump of this issue.
A not unheard of thing a Directory Service person may do would be to simply get a User Mode dump of lsass.exe as the server was hanging-if we can-and see what the various thread stacks are doing. This really is not the most common thing we will do since we have a specialty team which is tasked with-and much more expert in-depth debugging, but can be useful for a high level idea of what is happening. A simple way to get that Lsass.exe dump is by using ADPlus.vbs which is included with the Debugging Tools for Windows (a free download). Here’s an article on how to use ADPlus.vbs. You can open your .DMP file in Windbg.exe which is included in the Debugging Tools mentioned above.
Now, in my last post about my wife’s impromptu game interruption I mentioned how to look at a stack in memory using Process Explorer. That is a much more simple way to do the same thing we did in this scenario. It simply doesn’t let you compare the different threads at that time for trends, which might be useful in this type of situation. In other words, it doesn’t give you a comprehensive snapshot of all of the user mode threads at the time of the problem.
So in this case we dumped Lsass.exe several times and were able to see that in every case there was a hang on a domain controller there was a Netlogon thread doing the same thing. Since software running in memory moves so very quickly and a memory dump is similar to taking a quick picture of something in motion…so when you see something in several “snapshots” it makes this thing more interesting.
Incidentally, the simple command for viewing all of a user mode dumped process’ stacks is “~*k” (without the quotes). The only problem was that the thing it was doing seemed pretty harmless-just updating the list of trusted domains.
Having gleaned this from the .DMP file it was immediately apparent that there was a less difficult way to see this same thing: Netlogon debug logging on the affected domain controller, the one that has the hang condition. Sometimes we go very deep only to discover that we don’t have to think quite so hard to achieve the same results. Case in point are entries similar to below.
03/17 10:54:41 [CRITICAL] ACCOUNTING: NlDiscoverDc: Cannot find DC.
03/17 10:54:41 [CRITICAL] ACCOUNTING: NlSessionSetup: Session setup: cannot pick trusted DC
03/17 10:54:41 [MISC] Eventlog: 5719 (1) "ACCOUNTING" 0xc000005e c000005e ^...
03/17 10:54:41 [MISC] Didn't log event since it was already logged.
So why are the above Netlogon.log entries interesting here? To understand that we need to consider how netlogon works. I’m going to crib the following excellent explanation from someone who has a better depth understanding of it and laid it out very well.
The Netlogon service maintains a list of "server sessions" - each one represents a secure channel to the DC. The server sessions are identified by NetBIOS name of the client machine. Every member machine in a domain will have a secure channel with one DC in its domain, and all domain controllers will have a secure channel with the PDC Emulator , as well as to a DC in each (directly) trusted domain. These are all stored in the netlogon service as "client sessions", identified by the target DC NetBIOS name.
The problem which can occur is that a member server and a domain controller in different domains could have the same NetBIOS name. On local network segments this is quickly detected and noticed, and in every place it is possible in Windows we have code which will prohibit duplicate names-if noticed-from being used for different domain members. A domain controller which can see this problem-which is extremely rare-will place a secure channel for that DC of a trusted domain ACCOUNTING-let’s call it BIGDADDY-which may later be unexpectedly “hijacked” by a secure channel update from the server in perhaps yet another domain in the forest which has the same name of BIGDADDY, and that update creates a problem in the service while it tries to update the list of trusted secure channel partners for the other domains it trusts. The most confusing aspect is that all of this problem behavior revolves around BIGDADDY, but the problem took place on DC1 as DC1 tried to keep track of its secure channel partners-one of which was the DC BIGDADDY in a trusted domain.
Clear as mud now, right?
This is a pretty unusual occurrence, so I don’t expect folks to be reading this and then crying “Eureka!” as they see the solution to a problem they are experiencing. The value in discussing it here lies in the understanding of the troubleshooting process, the tools and techniques which can be used, and maybe a good glimpse into how trusts work, from DC to DC. So we went from looking at event logs, to analyzing performance data from Perfmon on a hanging DC, then to taking memory dumps, and finally to reviewing Netlogon debug logs to get a good understanding of the problem.
The solution in this case? Well, if you’re in this state simply search through your forest-and any external trusted domains and forests-for duplicate names and rename the servers which have the same name as a domain controller elsewhere in the environment. Better yet, maintain a company-wide process control on computer naming and re-use and you should never be in this unlikely place.
- Rumpo Venatus
-
The five or six people who have read my little bio snippet on Technet read that I like to play video games-specifically Xbox 360 games. I was doing just that the other night-playing Fallout 3-when my wife walked into the study to ask for help with all of the viruses which had just been detected on her laptop.
The gaming was interrupted. A wise married man makes sure his wife’s computer is working. Marital sanity is preserved using this and other fundamentals.
In this blog post we’re going to talk about what was going on in my wife’s laptop and how I fixed it.
Now, most of my blog posts revolve around server specific issues or information in the Microsoft Directory Services world-you know, AD, and the like. But this little diversion highlights how you use a really cool tool in a practical way that can be applied to many different troubleshooting situations. It’s also may be useful from the viewpoint that your average Directory Services specialist is typically also a security specialist at his or her office.
To sum up the issue, the laptop was infected with malware. I don’t know the specific vector that did it-probably some socially engineered link with a picture that looked exactly like something legitimate she clicks on usually-but it was obvious as soon as a looked at it that there was malware involved. Several Internet Explorer (IE) windows were open, several of which mimicked anti-virus detection software user interfaces. They had big warning messages that viruses had been detected and encouraging the user to click on buttons-buttons which were in fact hyperlinks to other places.
Rather than delve into how it got onto the machine I needed to move quickly to remove the malware from her laptop so that I could reconvene my adventure into the post-apocalyptic world of Fallout 3.
To do that I had to figure out how the IE browser was getting “hijacked”. It was apparent that-even with a new logon session or as another user on the same computer-all I had to do was open the web browser to the home page and after a few moments these uninvited pages would open on their own. So, nothing the user did at that point instigated the problem…other than simply opening IE. Clearly, something was happening in IE.
Let me take a quick moment to point out that this malware concern was seen on Windows XP (with the latest service pack and security fixes), not Vista. I seriously doubt this could occur with Vista’s builtin security-contrary to what slick Apple marketing says.
To find out more about what was happening within IE I downloaded Process Explorer. Some of the less commonly used things which Process Explorer can do is to show the threads for a process and their call stacks (think of this as information about what code has recently “fired off”), the open handles a process has, and what DLLs a process has loaded. These are exactly what was going to come in handy here.
Since I knew IE was there as a process I went to that in Process Explorer’s left hand pane and selected that process’ properties. I then clicked on the Threads folder tab so it was in front. What you will normally see when looking at this sort of thing is an “image name” and memory addresses. The image name really amounts to being the file name. Handy info, that, since we wanted to see if there were any questionable looking image names in any of the IE process thread’s call stacks.
And there were.
Clicking on the stack button after highlighting a suspect thread doesn’t show a great deal more but it is interesting.
So it looks like the malware is running from code stored in the file named “__c00817F6.dat”. It’s a safe bet that this file name changes frequently as an effort to elude anti-virus definition files catching it when they scan the computer after a definition file update.
A search in Process Explorer shows that winlogon loads this as a DLL and so has a handle to this file, suggesting that it is loaded when winlogon.exe starts. Since winlogon.exe is a core user session file (basically) that means this file will load anytime someone logs on with an interactive session.
Note: You’ll notice an Explorer.exe handle to this suspect file as well. That’s because I searched in Windows before I searched in Process Explorer.
So we have a suspect file. Now what? Well, it’s time to find where the file lives since we’re ultimately going to want to delete it.
So the file has an innocuous name and resides in a place where legitimate Windows files are: c:\Windows\system32. OK, delete that bad boy!
For this case, attempts to delete that file will fail. There are three obvious reasons this can fail: permissions on the file, there is an open handle to the file, or the file is loaded as a DLL and so is in use and cannot be deleted.
From the above Process Explorer search we can tell that there is both a handle open to the file and it is loaded into memory as a DLL, so those two things are going against us for deleting the file. A quick check of the permissions on the file show that we should have sufficient rights to delete it.
So now we have to figure out how to keep the file from loading at logon. For that we have to consider how things are called to be loaded in the first place. That is typically done by looking to the registry entries for that process and seeing what files are listed for it there.
Here’s where we look for Winlogon.exe’s relevant entries, and our suspect is present.
Key Name: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\Notify\__c00817F6
This registry entry has five values:
Name: Asynchronous
Type: REG_DWORD
Data: 0x1
Name: DllName
Type: REG_SZ
Data: C:\WINDOWS\system32\__c00817F6.dat
Name: Impersonate
Type: REG_DWORD
Data: 0x0
Name: Startup
Type: REG_SZ
Data: B
Name: Logon
Type: REG_SZ
Data: B
The first thought is “delete the registry entries!”. Unfortunately, the malware developer thought that someone might want to do that, too, and so they apparently have code to add the registry entry and it’s values back.
However, we can be clever here. In this case I simply change the DllName registry value data for the path of the malware to have a different name, and then quickly place registry permission Deny entries for everyone on that registry entry. That way the bad guy can’t change it back.
Then reboot, logon and viola! we can delete the __c00817F6.dat file. Now my wife can surf the internet tubes in peace.
Back to Fallout 3 at last! Game on!
Let’s file this under security, though the general Process Explorer technique above can be used for a lot of useful situations.
- Troubleshooting a Memory Leak in Lsass.exe
-
Although we have a team of engineers who are dedicated to troubleshooting general server performance related problems Microsoft Directory Services specialists are expected to be the “go to” people for Active Directory and domain controller related performance issues. This is especially true when the Lsass.exe process is noticed to be the using more resources than would be expected.
This is in part due to the fact that the Lsass.exe process is seen as a big black box by many. In reality, it is a process like any other which simply takes care of many core aspects of the operating system on any computer, and the additional roles that a domain controller has on a DC. Some of the things which are done in that "black box" are peer domain controller location, authentication, authorization and Active Directory replication.
What about Server Performance Advisor (SPA), you ask? I’ve mentioned using the SPA tool in my blog post “A Day at the SPA” but the SPA AD Data Collector is not the tool for every performance related problem which exists.
For example, what if your domain controller(s) appear to grow less responsive to clients over a period of time and then they eventually crash? The SPA AD data collector runs for 5 minutes by default, and for an issue which takes place over time in that way that would be an insufficient interval of data gathering for any meaningful reporting on what the problem is. Even worse, the SPA report even if extended might not see a problem even if you extend the data gathering to a much larger interval.
Let me reiterate that the “…and then they eventually crash” part of the above example. Although it may not occur in every memory leak instance it is something that a normal performance degradation issue will not see…whereas a memory leak eventually would. The crashing behavior is not something you will see with a simple load based issue on a DC where it is busy for some reason. This blog post is intended to give you a top level overview of what to do when you see just this occurance.
The reason for that is that SPA was not made to give effective diagnosis for memory leak or similar gradual performance degradation issues. It is a tool for establishing an understanding of the performance baseline of a server and for identifying immediate resource bottlenecks. This is a subtle but important detail.
What is a memory leak? Well, to answer that we have to understand a little aspect of programming. Programming languages require that memory be allocated, or set aside, for use in storing values that will be worked with, and then deallocated when the code is finished working with them. More detail on that is here.
Why would this be a concern to an Active Directory administrator? This is a concern because we don’t always have full control over all of the code which runs in our environment. It’s sad to say but AD people are rarely the Kings or Queens of All They Survey. In other words, in real life business needs introduce applications into the environment which may not be entirely bug free. A shocking revelation, I know. Sometimes these applications have the specific problem of not being able to deallocate their memory usage when running on or against a domain controller, resulting in a memory leak. There can be memory leaks in either kernel or user mode but application derived memory leaks are by nature user mode leaks.
In a memory leak situation more memory is allocated to new code executing over time but never deallocated for reuse. So the amount of memory in use by a process is always increasing. Over time the amount of memory needed for further code execution exceeds the amount of memory available for further use, or allocation This is a recipe for disaster on a server and is the central aspect of this which results in the crash behavior that memory leaks often produce.
Now that we have some background on what the issue is let’s talk about identifying whether something is a memory leak.
There are multiple methods to track this down when you need to. Many of these tools require that you enable some tracking of resources in memory so that proper reporting of that memory usage can be done. Put very basically, once you know you have a problem with a particular process (which this article assumes you do…and it is Lsass.exe) you need to find out what resource is leaking and/or what function(s) are doing it. To find out which resource is not being deallocated properly you will need to tag it, so to speak, so that those tags can be counted. Adding a tag in itself utilizes memory and other resources and as a result it’s not something that is enabled by default. Instead the Glfags.exe tool allows you to enable these tags to track the resource usage on “objects”. A better explanation than I can give is available at the above Gflags MSDN link above.
We’re going to go over some common tools and methods used followed by a new one that can give a nice readable report, in a similar fashion to what SPA does.
Two of the “traditional” methods are to use Performance Monitor (also known as perfmon) or the User Mode Dump Heap (UMDH) tool to identify the leak. Memory usage in these tools is referred to in bytes and typically tracked by seeing an increase in the number of private bytes used by a process. Remember, for the purposes of this troubleshooting discussion the process in question is Lsass.exe, which runs your Active Directory code (put simply). These two tools are discussed in good detail here. We won’t be going into step by step detail on how to use Perfmon or UMDH to troubleshoot a memory leak since the MSDN article does a good job of that.
Here’s a really good excerpt from the above MSDN article on this:
The Private Bytes counter indicates the total amount of memory that a process has allocated, not including memory shared with other processes. The Virtual Bytes counter indicates the current size of the virtual address space that the process is using.
Some memory leaks appear in the data file as an increase in private bytes allocated.
Another method is to use Poolmon. Poolmon is useful in that it can display outputs of Gflags.exe-enabled tagged memory and is often used for finding memory leaks.
There are two Poolmon output samples below. Examine the Diff (allocations minus frees) and Bytes (number of bytes allocated minus number of bytes freed) values for each tag, and note any that continually increase.
=== Wed 06/11/2008 07:39:17 ComputerName=DC1 FreePTEs=9,202 ===
SystemUpTime(hours)=20.71; ProcessTotalHandleCount=100,644; SystemThreads=880; SystemProcesses=74
Memory: 3997176K Avail: 1458620K PageFlts:33958221 InRam Krnl: 2544K P:248068K
Commit:1993880K Limit:5938828K Peak:2032972K Pool N: 42,708K P:249,248K
Tag Type Allocs Frees Diff Bytes Per Alloc Mapped_Driver
Toke Paged 3926552 3907851 18701 183714352 9823 [nt!se - Token objects]
And now notice the change in the Toke (token) object paged number...
=== Wed 06/11/2008 07:40:18 ComputerName=DC1 FreePTEs=9,202 ===
SystemUpTime(hours)=20.73; ProcessTotalHandleCount=100,562; SystemThreads=885; SystemProcesses=74
Memory: 3997176K Avail: 1459012K PageFlts:33978385 InRam Krnl: 2544K P:248020K
Commit:1993684K Limit:5938828K Peak:2032972K Pool N: 42,708K P:249,208K
Tag Type Allocs Frees Diff Bytes Per Alloc Mapped_Driver
Toke Paged 3931182 3912503 18679 183890888 9844 [nt!se - Token objects]
Although the above was only a one minute difference and you can clearly see an increase in the bytes for token objects. This suggests that there is some code which is not deallocating, or making the used memory available for use, memory used to store a token once that code has finished. This is a guideline however and you would want to watch the Diff and Byte values over a longer period of time to truly ascertain whether there was a gradual and consistent leak present there. There's a variety of indicators to look for-many ins and outs-but the numbers cannot lie to you when they continually increase.
Finally, I mentioned a cool tool that provides a nice report. That tool is called the Debug Diagnostic Tool which can be downloaded from here. This tool is commonly referred to as simply DebugDiag . DebugDiag was created to troubleshoot IIS related concerns with custom code running in applications pools and the like. It is a great tool overall, though, and is one that we can occasionally use for troubleshooting Directory Services issues. In this scenario it is useful to use the tool to gather sequential memory dumps and then have DebugDiag generate a report from them which will tell you about any perceived leaks.
So, first you have to have it run and gather the dumps of Lsass.exe while the issue is occurring. When you do that keep in mind that this is not something to do lightly-it is invasive and can degrade the performance of your system in itself. So only do it when you must in order to track down a problem. Here are the steps.
1. Click Start, point to Programs, point to IIS Diagnostics (32 bit), point to
Debug Diagnostics Tool, and then click Debug Diagnostics Tool.
2. Select Memory and Handle Leak Rule, and then click Next.
3. Select LSASS.EXE in the Select Target dialog and then click Next.
4. In Configure Leak Rule dialog you can specify a warm-up time. However, in most cases we should instead click the Configure button under “Userdump Generation”.
5. In the Configure Userdumps for Leak Rule dialogue which appears make sure that the Auto-create a crash rule to get userdump on unexpected process click on the radio button for “Generate a userdump when private bytes reach” to select it. The default is 800Mb. Let’s change that to 900Mb, and select to do additional dumps every 50Mb thereafter.
7. Click Save & Close.
8. Click “Auto-unload LeakTrack…” to add a check mark there.
9. Click Next, and then Next again.
10. Click Finish on the Select Dump Location And Rule Name windows. The Userdump Location can be changed here. Note The status is now active. The Userdump count will increase every time that a dump file is created. The default dump file location is C:\Program Files\IIS Resources\DebugDiag\Logs.
Next you need to generate the report. To do that simply open DebugDiag, add the files you gathered above using the Add Files button, choose “Memory Pressure Analyzers” and click the Start Analysis button. Once that analysis script is complete you will have a handy, albeit very detailed, report outlining what code appears to be leaking.
This entire scenario came about because you are seeing your domain controllers performance gradually decrease over time, followed most likely by their crashing, rebooting and starting the cycle all over again. The goal here is to give you the tools and know how to at least understand the issue, but at best be able to track down this issue and find out what the actual problems is.
And then you can fix it. Fixing it may mean an update of some application or perhaps to configure or uninstall the offending code. It may even mean that you will need to contact the company who makes that software and see what remedies they know of for that problem. In any case you’ll have a handle on what is happening, and a leg up on getting past the problem.