If you are like every other IT person I know you are doing a lot of technical support  for family and friends over the holiday season.  I thought of that last week  and decided to write a short blog post detailing one of the least hyped and most useful items in all of the current Microsoft product offerings. 

I’m talking about Microsoft Security Essentials (MSE).  MSE is a free anti-virus application that will run on Windows XP, Windows Vista and Windows 7.

Did I mention it is free?

I’ve been using this on several of my computers for months and it does a good job.  I’ve also installed it on friends and relatives computers and, in the cases where there were Trojans or other malware already present, it quickly found and removed the malicious code.

There are a quite a few easily found reviews available if you want to read up on MSE before trying it.  This is a quick, easy to install and free (did I mention that it is free?) solution.

If you are working in a support or engineering role with Microsoft platform products like the various Windows versions one of the biggest struggles you can have is understanding what to expect in code and on the network when Windows computers communicate to each other and other platforms.  Documentation at that level of depth is scarce and what is available can be a tough read.

But not so much anymore.  As part of the Microsoft Communications Protocol Program (MCPP) documentation for the various protocols are now available.   The documentation is searchable online or you can download all of the Windows Communications Protocols in PDF format all at once by clicking here

It’s important to understand that these are not just reference level documents.  They have the information you need to understand how our products work in detail.  If you need to make one New Year’s Resolution it should be to read these documents in 2010.

Here are some of the more useful documents included in that download that I recommend you review:

MS-GLOS glossary of terms which has excellent short, concise definitions of technical terms used in the various Windows Communications Protocols.

MS-AUTHSO Windows Authentication Services Overview.  This document details how all of the various specifications fit together.  Good for a putting a framework in place for your other knowledge.

MS-KILE Kerberos Protocol Extensions.  This document is covers how Windows Kerberos works with Active Directory.  Service integration, PKI, encryption, transport mechanisms, group membership, interactive logon information and delegation are all items which are included in this PDF.  This is a must read document if you need to understand Kerberos in a Windows environment.

MS-SFU Kerberos Services for User.  Crucial for understanding how Kerberos is expected to work generally and how the Microsoft Kerberos implementation preserves identity and maintains security.

MS-PKCA Public Key Cryptography for Initial Authentication (PKINIT) in Kerberos Protocol Specification.  This document details how Public Key Cryptography (PKI) is used in Kerberos for initial ticket exchange.  If you use or plan to use smartcard logon or other PKINIT capable certificate for user logon this document is useful to understand what the general requirements are and how PKINIT will appear in a Kerberos AS exchange on the network.

MS-PAC Privilege Attribute Certificate Data Structure.  This is all about the user or principal token on the wire.  This PDF includes information about how the PAC is laid out and what it contains.  This is more useful if you are in a situation where you are debugging an application or access, but it is good reference information for general knowledge as well.

MS-SPNG SPNEGO authentication negotiation.  Useful in understanding what you are seeing for authentication negotiation in network captures.

MS-NLMP NT LAN Manager (NTLM) Authentication Protocol Specification.  This is covers NTLM with definitions, protocol examples, messages and more.

MS-CIFS Common Internet File System Protocol.  The PDF contains details of how the file transfer communication works.  Particularly useful if you need to understand how file and print services work over the network from client to server.

MS-SMB Server Message Block (SMB) Protocol Specification.  SMB is an extension of CIFS, and this document defines what those those extensions are and how they work.  This is the stuff you see when you filter a network capture for SMB.

MS-DFSC Distributed File System (DFS). Have you ever needed to try to figure out what went wrong or why something unexpected happened with a DFS referral?  This is the document for you since it covers how DFS communication works and contains protocol examples.  This does not cover DFS replication (DFSR).

MS-DFSNM Distributed File System Namespace Management Protocol Spec.  This specification document contains information on how DFS management works on the wire using Remote Procedure Call (RPC) network traffic.

MS-FSSO File Access Services System Overview.  Has one of your users ever complained that they can’t get access to a file on a share and normal troubleshooting for permissions didn’t reveal the answer? Read this document and reviewing a network capture of the activity should be much easier.

MS-GPSO Group Policy System Overview.  This document goes over how group policy is obtained by client from server in detail.  If you are an administrator that administers Group Policy you should read this.  It contains a level of detail previously unseen outside of Microsoft training.

The download contains many more PDF files that may prove useful to you depending on your daily routine.   If you are someone that wants to take your knowledge of Windows to the next level-way beyond what certifications require-this is the stuff for you.   Consider it Microsoft’s holiday gift to you.  Enjoy!

In this blog post we’re going to go over a few techniques that are a bit old school but will come in handy for understanding how things work even if you ultimately use a great monitoring suite like MOM. Now, there are great articles here and here that describe good general ways to start checking your AD replication-and the information on those articles still applies. In this post we’re going to go a bit past and to the side of them though.

Before we go further we need to go over USN Highwater-marks and Up to Dateness vectors and how they are used. In my experience these are the two data points in tracking updates that are the most confusing in Active Directory replication.

Of course, USNs are Update Sequence Numbers and are an ever increasing counter of numbers assigned to updates-unique per domain controller. As updates are received from peer replicas, or as updates originate at that domain controller itself, the next USN in the series is used to signify that update. In other words USNs are local numbers on each DC. However, those local USNs are monitored by peer domain controllers who look at what the most recent and highest number USN was in order to help decide whether or not some of those updates are needed to be replicated in. If they are not needed then they can be discarded…which is what propagation dampening is.

A recent supportability article had excellent explanations of up-to-dateness vector and high water mark which I’m pasting below:

For each directory partition that a destination domain controller stores, USNs are used to track the latest originating update that a domain controller has received from each source replication partner, as well as the status of every other domain controller that stores a replica of the directory partition. When a domain controller is restored after a failure, it queries its replication partners for changes with USNs that are greater than the USN of the last change that the domain controller received from each partner before the time of the backup.

The following two replication process values contain USNs. Source and destination domain controllers use them to filter updates that the destination domain controller requires.

  1. Up-to-dateness vector: A value that the destination domain controller maintains for tracking the originating updates that are received from all source domain controllers. When a destination domain controller requests changes for a directory partition, it provides its up-to-dateness vector to the source domain controller. The source domain controller then uses this value to reduce the set of attributes that it sends to the destination domain controller. The source domain controller sends its up-to-dateness vector to the destination at the completion of a successful replication cycle.
  2. High water mark: Also known as the direct up-to-dateness vector. A value that the destination domain controller maintains to keep track of the most recent changes that it has received from a specific source domain controller for an object in a specific partition. The high water mark prevents the source domain controller from sending out changes that are already recorded by the destination domain controller.

Let’s dig in with a scenario where you are the admin and you have noticed that there is a replication backlog at some AD sites. In this situation we have anecdotal complaints from our help desk that users created in New York but it is hour or even occasionally days before we see those users on DCs in the Los Angeles site. Although it’s sometimes wise to take help desk reports with a grain of salt this isn’t something you want to ignore.

We have three sites-Los Angeles, Kansas City and New York-and we have DCs in each site. For the question at hand we need to figure out whether there is, in fact, a replication back log and if so how big it is. Repadmin.exe, since it is the Swiss Army knife of AD replication tools, would be the first tool to use (repadmin /showrepl * /csv that is) however it is entirely possible to have a back log of updates between two replicas and not see constant or even intermittent errors from them if they are replicating-albeit replicating slowly.

Now let’s see why the USNHighwater-mark and Up-to-Dateness Vectors are important in tracking updates by using the command “repadmin /showutdvec <hostname> <distinguished name of naming context>”. To understand what is happening between the three DCs Server15 in LA, Server17 in KC, and Server12 in NY we will need to run the showutdvec command once on each server and then examine the results.

Ran on or against Server15:

LosAngeles\server15 @ USN 16531174 @ Time 2009-09-21 13:54:45

KansasCity\server17 @ USN 35282103 @ Time 2009-09-17 12:51:15

NewYork\server12 @ USN 1581572 @ Time 2009-09-21 13:54:39

Ran on or against Server17:

LosAngeles\server15 @ USN 16531174 @ Time 2009-09-21 13:54:45

KansasCity\server17 @ USN 36483665 @ Time 2009-09-21 10:54:41

NewYork\server12 @ USN 1581572 @ Time 2009-09-21 13:54:39

Ran on or against Server12:

LosAngeles\server15 @ USN 16531174 @ Time 2009-09-21 13:54:45

KansasCity\server17 @ USN 35295102 @ Time 2009-09-18 07:03:08

NewYork\server12 @ USN 1581572 @ Time 2009-09-21 13:54:39

Let’s take KC and NY and compare them:

KC LOCALLY:server17 @ USN 36483665

NEW YORK: server17 @ USN 35282103

Now subtract what NY knows of KC having versus what KC has as high water mark:

36483665 minus 35282103 = 1201562

So there is a difference of 1,201,562 between what the Kansas City server named Server17 has and what its peers think it has. This tells us that Server17 has received (from some other DC not listed above) or originated approximately 1.2 million updates and that the LA and New York servers have not processed those updates yet. This also tells us that the KC DC Server17 is receiving inbound updates from the other two sites just fine.

That suggests a replication backlog, since the up-to-dateness vector (that USN number above) for Server17 which the LA and NY servers have retained for tracking locally are lower than the USN Highwater-mark which actually is on the KC server itself.

Are all of these updates ones that the NY and LA actually need? Perhaps not-it simply depends on the nature of the updates. More than likely propagation dampening will occur as the replicas try to process the updates from KC. Propagation dampening is the routine which assesses whether a received updated is needed by the local domain controller or not.   If the update is not then it is discarded.  For those unneeded updates you would see an event like below following a similar event ID 1240 if you have your NTDS diagnostic logging for Replication events turned up:

9/20/2009 10:35:30 AM Replication 1239 Servername

Internal event: The attribute of the following object was not sent to the following directory service because its up-to-dateness vector indicates that the change is redundant.

Attribute:

9030e (samaccountname)

Object:

<distinguishedname of object>

Object GUID:

d8frg570-73f1-4781-9b82-f4345255b68u

directory service GUID:

9fbfdgdf66-3e75-4542-b3e7-2akjkj776b

That leads us to the question of how to find out more about what those updates are.

To do that we can issue an LDAP query against KCs DC Server 17 for all of the objects that have a recent WhenChanged attribute. To do that we first get the USNHighwatermark for the given partition from our showvector command above and subtract a number from it in order to display the most recent updates against that DC. In our scenario that would be 36483665, and we will subtract 1000 in order to query for the most recent 1000 updates.

1. Open LDP.EXE.

2. From the Connection menu select Connect and then press OK in the Connect dialogue that appears.

3. From the Connection menu select Bind and then press OK in the Connect dialogue that appears.

4. Next, click on the Browse menu and select Search. 

5. Enter the partition’s distinguished name in the BaseDN field (DC=<partname>,DC=com).

6. Paste the following in the filter field: (usnchanged>=36482665)

7. Select Subtree search.

8. Click on Options and change the size limit to 5000.

9. Still in Options add the following to the Attributes list (each entry separated by semicolon) to those already present:  usnchanged;whenchanged

10. Then click Run.

And here is a sample of our result set:

>> Dn: CN=Test134417,OU=Accounting,DC=treyresearch,DC=com

4> objectClass: top; person; organizationalPerson; user;

1> cn: Test134417;

1> distinguishedName: CN=Test134417,OU=Accounting,DC=treyresearch,DC=com;

1> whenChanged: 09/13/2009 15:11:26 Central Standard Time;

1> uSNChanged: 36483650;

1> name: Test134417;

1> canonicalName: treyresearch.com/Accounting/Test134417;

>> Dn: CN=Test134418,OU=Accounting,DC=treyresearch,DC=com

4> objectClass: top; person; organizationalPerson; user;

1> cn: Test134418;

1> distinguishedName: CN=Test134418,OU=Accounting,DC=treyresearch,DC=com;

1> whenChanged: 09/13/2009 15:11:26 Central Standard Time;

1> uSNChanged: 36483649;

1> name: Test134418;

1> canonicalName: treyresearch.com/Accounting/Test134418;

In this case, after a large sampling of all of the most recent updates to occur on the KC DC, we see that someone or something is creating users named Test<number> in the Accounting OU. Is it some provisioning software that the accounting department uses? A migration from another directory? What if the objects were of some other type, something unique enough to be immediately understood? These are all questions that you can apply to a concern like this once you have an idea about those updates you were looking for.

Eight years ago today people had started their day at the World Trade Center in New York, the Pentagon in Washington DC, and on a few planes. Some were commuting to work destinations and others were already at work. These were people that were working to feed their families, pay their bills, improve their lives. It is certain that, as you and I often do, they would pause in their work or travel to think of their loved ones and when they would see them next.

Among those thousands was an IT administrator that worked at a telecommunications company in one of the Trade Center towers. He had a support case with us that I and my colleagues had been working with him on. This was a case that we were never to resolve.

In 2001 I was working a “4x10” schedule of four 10 hour days, a Thursday through Sunday schedule. Tuesday morning was my weekend and so I was surprised to be woken by a family member calling and telling me to turn the TV on. I did that a few moments before the second plane hit.  That was where I was when it happened.

Following this we at Microsoft gave free support to anyone who was a victim or associated with them. To that end I spoke to various social workers, military personnel and businesses who were recovering their personal or business IT infrastructure over the following weeks. This was a minute fraction of the practical impact of the terrible personal tragedy that was multiplied thousands of times over in the lives of the survivors and families affected by the terrorist attack.

The practical portion of assisting people with their information technology was soon over but that personal impact remained and remains.

Let us take a moment of silence for the admin I was working with and the many others who were victims or affected by September 11th.

This week I’ve had the need to do some testing around ADAM (also known by it’s shiny new name of Active Directory Lightweight Directory Services or AD LDS).  The tests themselves are not directly relevant to this blog post, but in order for the tests to have some validity the ADAM instance needed to be a larger than the default install.

In the absence of a nice bulky backup or other directory instance to take previously created objects from I needed a quick method to bulk that bad boy up.  This can be useful in creating objects and monitoring as they replicate among your instance replicas, testing for scaling of your ADAM back end solution given your hardware and network topology. 

When I looked around for some pre existing script which could do what I needed I found plenty of scripts….but none that could do the trick in a small number of iterations. 

The best script repositories for ADAM that I could find are the original “Madam, I’m ADAM” article and The Script Center.  But none of these were made with the intention of creating a high volume of arbitrarily named objects.  So a little bit of thought was required (unfortunately).

There are different ways to provide named object creation in that way, but the only ones I could think of offhand used a “for” loop.  Rather than providing an “answer” file of names to pipe into the script I tried to keep it simple and put the for loop within the script and instantiated a variable i to do the job.

Here’s the script, which is set to create 100,000 user class objects.

'**************************************************
' This script adds i = n users to a specified OU or CN. 
'  Alter as needed.
'**************************************************

' If the application NC  DN is "ou=adamou,c=us" and the server is "adamhost" and the port is 389. Then the parameter 'OUName   should be passed
' as follows:  "LDAP://adamhost:389/ou=adamou,c=us"

Set ou = GetObject("LDAP://localhost:389/OU=Users,DC=treyresearch,DC=com")

For i = 1 To 100000

set usr = ou.Create("user", "cn=" & "Test" & i)

usr.Put "displayName", "Test" & i
usr.Put "userPrincipalName", "Test" & i & "@treyresearch.com"

usr.SetInfo

Next

wscript.echo "100000 users created successfully"

Just call the script from command line using cscript.exe.

image

After the script runs you should have a container or organizational unit with stuff in it…

image

and consequently a larger ADAM database file (in the case below about a million user objects added)…

image

A few ‘scripting for ADAM’ caveats came up whilst doing this.  Some are documented in various places, others may not be so I”m putting them here for all to see.

  • Each object creation (each iteration of the for loop in the script) is an LDAP Create against the directory.  It is possible to run this script from a remote host against your ADAM instance if your alter your LDAP URL accordingly.  You could also run multiple, simultaneous renditions of this script against your ADAM instance.  Just watch out that you don’t bog your ADAM server down.
  • You cannot bind successfully to the RootDSE in an ADAM directory instance to do things so don’t try it in your scripts.  Instead pass in the complete DN path with your LDAP URL. 
  • If you intend to use this script multiple times, even against different destination organizational units, keep in mind that there is uniqueness required in these objects.  Since the objects are created named Testn you will need make sure that if you run the script more than once your i value covers a new range of numbers.  For example, if you start with 1 To 100000 then the next time you run the script choose 100001 to 200000.
  • There is some value to not choosing the biggest number range you can think of and placing that into the script.  The reason for this is in part dependant on hardware and whatever methods you will choose to use to review that information (provided that is your intention).  Consider that ADAM-ADSIEdit will only display a certain number of records in a container before simply displaying an error to give context to this idea.  You can always get around this by using LDP.EXE and altering your query size limit, but it’s just something to keep in mind.  Additionally, on a not-so-beefy x32 Windows Server 2003 computer, having the script create 700,000 objects in a single OU took greater than an hour-though the time it takes depends on hardware and preexisting load on the server.
  • Make sure and create your instance with a distinguished name (DN) path that initially contains DC or CN rather than something else.  The reason for this is not so well documented in the help file but will basically prevent you from creating OUs or other child objects successfully.
  • This is something that is a good thing for testing.  If you ran a script or scripts like this against a live, production ADAM instance you might provide yourself with an inadvertent denial of service attack.  That’s the IT equivalent to shooting yourself in the foot.
  • Finally, you could alter the script to create other type of objects.  It all depends on what you need, and what attributes are mandatory for that class object when it is created.  Just alter the user parameter

set usr = ou.Create("user", "cn=" & "Test" & i)

to the class object you need, alter the usr.Put verb to match, and make sure you pass in the info that different class needs.

Hopefully this will help out people who are in a “proof of concept” or other testing phase and need to bulk that directory up in a hurry.

There are a whole host of issues that are simply never seen unless you have a large distributed environment. I know that sounds startling but here’s a hypothetical example. Imagine that you are an online retailer and for every identity that you are transacting business with an object in a AD LDS/ADAM database is created or updated.

If you have many business transactions (generally a good thing from a business perspective) then the number of client connections and updates swells accordingly. In an ideal world, the IT staff has an opportunity to create a test environment and scale it out as a proof of concept-verifying that the solution won’t run into any surprise problems.

In reality sudden business success or adoption can often outpace prior testing results. This is when you discover any potential scaling issues or performance bottlenecks. Of course the moral of this introduction is to do everything you can to scale out your test environments and place them under the most severe load you can conceive of prior to going “into production”.

Let’s talk about one scenario that we’ve seen a time or two.

I have a distributed application that has to read, write, update and remove many very large (hundreds of megabytes each) files from a set of a file servers. When I say distributed I generally mean that the client application could be running on many different computers. Think hundreds. In contrast, the set of file servers which are providing the server side of this scenario are a much smaller in number. There is no hard and fast rule about this and what the threshold for the issue may be but let’s say that in our scenario we are seeing a ratio of 20:1-twenty clients to 1 server.

The issue occurs during peak hours-Monday at 9AM-when everyone in the main office arrives to work and begins to use the application to do whatever it is they are doing.

What happens is that-after connecting successfully to a particular file in order to update it-Joe User has noticed that he cannot update the file anymore and instead gets a “file not found” error. Joe dutifully calls the helpdesk and reports said error.

The helpdesk folks notice several calls at about the same time from other users. All of the calls mention the same file, around the same time, with the same “not found” behavior. When they examine logs they notice that the client connections are going to the same server. What is even more interesting is that the clients-the same clients-are able to successfully update other files on the same server at that same time. Even files in the same destination directory on that server. 

After a few days of seeing this issue being reported they see that it consistently goes away after a few minutes without any user or administrative action. In other words the workaround is to wait a few minutes and try accessing the file again, in which case the access would then succeed.  To add to the confusion the file could be seen to be present in the directory at that time if viewed in a local session on the server.

So what do you do in this situation?  Network captures are always a good idea in order to get a thorough understanding of what is happening in the client to server communication.  In this case though you simply see the “not found” message in its primeval form on the wire:

1681 0.509984 {SMB:190, SMBOverTCP:181, TCP:27, IPv4:26} 192.168.1.8 192.168.1.9 SMB SMB:C; Transact2, Query Path Info, Query File Basic Info, Pattern = \repository\ilovescotch\singlemalts.lnot

1682 0.510083 {SMB:190, SMBOverTCP:181, TCP:27, IPv4:26} 192.168.1.9 192.168.1.8 SMB SMB:R; Transact2, Query Path Info - NT Status: System - Error, Code = (52) STATUS_OBJECT_NAME_NOT_FOUND

Some folks may have thoughts of exclusive handle locks or problems with opportunistic locking bouncing around in their heads right now.  Sadly, in this type of scenario neither are the culprit (Process Monitor’s handle functionality can reveal the former and the capture should reveal the latter).

The best thing to do in this situation is to gather Perfmon data from the server side starting prior to the issue occurring and ending after the issue has resolved itself.  What should you gather?  The usual suspects-disk, memory, process, processor and network objects, all counters are a good start. 

There are disk bottlenecks to look out for but rarely do you see “not found” type messages as the result of them, but here is an instance where you can.    Split IO can result in this behavior when a large number of hefty files are accessed simultaneously, ultimately leading the server side redirector (a victim as much as the client in this case) to pass along a STATUS_OBJECT_NAME_NOT_FOUND error since it truly couldn’t find that file-though it was there. 

MSDN describes this counter thus:

Split IO/Sec
Shows the rate at which that I/O requests to the disk were split into multiple requests. A split I/O may result from requesting data in a size that is too large to fit into a single I/O or that the disk is fragmented on single-disk systems.

How much is bad?  Well, zero is the optimal number. So if you see a number other than that (example below) this should raise bottleneck concerns.

image

In its natural environment you will spot split IO as it appears to sympathetically rise with performance spikes in disk queue length, writes and reads such as below.  Notice that the split IO (thick and scary red looking line) rises when disk queue length is at it’s highest.  This is a bad sign- though barring additional symptoms like the “not found” errors would not come to someone's attention as something in dire need of fixing.  And the “not found” scenario is much more likely in the case of multiple large single file requests than it would be if the files themselves contained less data each.

image

image

In this scenario-if you are running into it-you have one of three different things occurring.  First, if you are lucky, you simply need to defragment the hosting volumes for the data on that server.

Or it could be that you have reached a milestone in your business growth by needing to scale into a storage area network (SAN) or get a better performing RAID configuration or hardware set.  The key to keep in mind in this choice is that the I in RAID stands for Inexpensive.  The even less costly alternatives are to somehow distribute or lighten the load on the servers in order to crawl along beneath the performance threshold the hardware is imposing.

Why am I blogging about something like this, where there wasn’t even an “access denied” error or AD replication mention in the entire post?  Because Directory Services is the Bermuda Triangle of difficult technical issues at Microsoft.  We are presented with the unexplained phenomena of the IT world.  It’s what we do.

Let’s christen a new category and file this in it: Unexplained Phenomena.

Weeks ago I blogged about how single sign on and credential providers work and a scenario you can run into with them. One reader faced a slightly different scenario but was able to apply that topic toward getting his issue resolved.

He had installed a credential provider for testing purposes. Unfortunately, once the credential provider was installed he was unable to logon at all. In his case he knew what the problem generally was-the provider he was testing-but initially wasn’t certain how to remove it since he couldn’t successfully logon to the computer. Some folks are probably thinking ‘hey, what about Safe Mode?’. Unfortunately, Safe Mode will not allow uninstalls since the Installer Service is not running.

You can disable a 3rd-party credential provider in the registry without deleting credential provider key. Each provider on the system is specified by a subkey whose name is the provider's CLSID under HKLM\Software\Microsoft\Windows\CurrentVersion\Authentication\Credential Providers.

To disable the provider add a REG_DWORD value "Disabled"=1 to that provider’s CLSID subkey. The provider will be disabled on the next session creation (sessions are created when you log off, switch users, or reboot.

The above can be done by remotely connecting to the target computers registry using the Regedit.exe FileàConnect To Network Registry option. Alternatively, you can reboot into Safe Mode to do your registry editing. As mentioned above, in Safe Mode LogonUI will only load the built-in username/password provider.

Special thanks to Don Woeltje from the College of St. Catherine for relaying his experience.

It’s easy to forget that when we say “Directory Services” we are really talking about multiple technologies. I remember when the idea that what we support is so much more than simply a user account repository first hit me. It happened when I first read the Windows 2000 Distributed Systems Guide from the Windows 2000 Resource Kit. This was, and in many ways still is, the “go to” reference and starting point for DS knowledge, and it made the idea that we support a collection of disparate but often interdependent technologies really hit home to me.

One of those technologies is Lightweight Directory Access Protocol (LDAP). LDAP is not truly a separate technology but rather a network protocol for data retrieval and modification. Active Directory is accessible using LDAP and how AD replicates is conceptually and practically built on the LDAP actions of Add, Modify, Move and Delete (for more info on that read the Distributed Systems Guide on originating updates, pages 315-317) . This is important to keep in mind since it gives a deeper understanding of how AD works under the hood.

LDAP is used in other DS components as well. If you were to capture the network traffic from your domain-joined computer while it was processing group policy you would see a series of LDAP queries to the domain as the list of group policies which apply to the user (or computer) are assessed for applicability.

The true value in LDAP is its use as a standard for applications to interact with AD. The actions of Add, Modify, Move and Delete I mentioned above are exposed in code that a developer can use write into their applications in order to store and retrieve data in AD. This is one of the many things that make Microsoft Active Directory and Lightweight Directory Service (also known as ADAM) such great solutions for writing business solutions. LDAP is simply one of many exposed entry points for data access in AD-otherwise known as Active Directory Service Interfaces. One stop shopping-code samples and references, a wide array services interfaces, reliability and tough integrated security are just how Microsoft AD rolls.

If you’ve read my blog posts before you know to expect that there is probably a specific scenario that brought all these thoughts together and instigated this blog post.

This scenario started with a customer who was developing an “in house” application which would query the Active Directory for selected attribute values for a list of users who belonged to a specific security group as part of the many things it did to get its intended purpose accomplished. This group was either domain local or universal in scope and as such contained members from each of the 4 domains in their forest.

The problem they saw was that the queries-which were directed to a domain controller in the root domain-would only return root domain users in the results response despite the fact that users from the other three domains were also members of that security group. This AzMan article outlines what an LDAP query intended to gather a list of users in a group should generally look like.

So the first order of business in troubleshooting LDAP is to see the LDAP query for ourselves rather than relying on what we expect or assume it may be. There are several ways to do that easily. First, you can take a network capture of the query using Netmon or Wireshark and simply filter for LDAP (provided the LDAP session is not SSL encrypted).   This technique of viewing the network traffic in order to see the bind type can be useful since we know that an unauthenticated LDAP bind will not receive an LDAP referral given the default Active Directory security in later versions.  More information on LDAP authentication and signing can be found here.

Or we can configure the registry keys below on the domain controller receiving the query to provide the details in the Directory Service event log.  The caveat being that you must know which domain controller will be servicing the query.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters\Diagnostics

15 Field Engineering = 5

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

(add dword values)

Expensive Search Results Threshold: = 1

Inefficient Search Results Threshold: = 1

Which in our case gives us this result:

Internal event: A client issued a search operation with the following options.
Client:192.168.2.4
Starting node:
DC=tspring,DC=com
Filter: (memberOf=CN=coolpeople,OU=DallasUsers,DC=tspring,DC=com) 
Search scope: subtree
Attribute selection:displayName,objectSid,proxyAddresses,cn,sAMAccountName,distinguishedName,groupType
Server controls:
Visited entries:5
Returned entries:1

Some folks who are reading this have already discerned a possible problem or two. What port was the query directed to? Or, was the option for “chase referrals” passed in with the query?

Let’s take a moment to talk about that. The way LDAP compliant servers work is by simply providing an answer to a query that is passed to them by a client. All servers, Microsoft or otherwise, must comply to the public RFCs which define how LDAP will behave.

Getting back to our concern.  What if the specific domain controller doesn’t have all of the data that the query is asking for? Such as when it resides in another domain? When that happens the DC will issue an LDAP network response called a referral.

For Active Directory the basic rules around when you will receive a referral are when the client indicates in the query they will chase them, or when the DC answering the query notices that that domain name which all or part of the query is target at is not one which it recognizes as part of the forest. It is able to recognize internal domains by reading the list of crossref objects  it contains in its copy of the forests Configuration container which essentially show the domains which the forest knows of-both internal to the forest and external to it (trusted or trusting).

Crossref objects are good things to consider when reviewing LDAP referral behaviors.

We have a KB article which describes the process of reviewing the crossrefs and determining whether to issue a referral.  Here is the most pertinent portion of that, which assumes that the LDAP query option of “chase referrals” was not passed in the initial query:

  • If a crossRef object that matches the search criteria is found and the crossRef object corresponds to an NC that is on the domain controller, the search is performed locally.
  • If a crossRef object that matches the search criteria is found and it refers to an NC that is held elsewhere, the domain controller generates a referral based on the dnsRoot attribute of the crossRefobject.
  • If a crossRef object that matches the search criteria is not found, the domain controller determines whether a superiorDNSRoot attribute exists for the crossRef object in the forest root domain. If it does exist, the domain controller generates a referral to that location. If it does not exist, the domain controller tries to use the DC naming convention to generate a DNS name for the client referral.
  • In a situation where we are seeing a different or incomplete result being returned for a query like this our initial goal should be to see if the same results happen when we use LDP.EXE (a Support Tool in Server 2003 but included during role install in 2008 and later).

    In LDP.EXE, after we bind and connect to our directory, we can issue a query by selecting Search from the Browse menu.

    image 

    To set “chase referrals” in the query you need to select Options and put the little check there for that.

    clip_image004
    Where’s the value in emulating the LDAP query using LDP.EXE?  The value lies in ruling out any behavior that is inherent in the customized code being used to issue the query.  If the above query without the referral chasing option selected gives the exact same results as the failing query, and the with the option selected we get the correct results (every user who is a member of that group) then we have a firm idea of what is happening.

    In this case, since the users are all in domains local to the forest the initial query must contain the chase referral’s option in order to receive all users who are members of that group throughout the forest.

    However, that is not the best option.  Consider the act of asking for some information and then getting only some of the results, then being referred elsewhere for the rest.  In itself that’s not a huge performance loss but if you repeat that action many times in sequence or simultaneously then there will be a performance loss both on the network and on the domain controller servicing the requests.

    This is where your Global Catalogs come in handy.  The global catalog contains information from each domain in the forest.  All we need in order to service a request like we have discussed in this blog post is to direct the query to the global catalog rather than to a specific domain and end up chasing referrals all over your network. 

    How do we direct the query to the global catalog?  Rather than relying on the default connection port of 389 simply send the query to TCP port 3268 (or 3269 if SSL encrypted) explicitly in your connection.  Here's a MSDN link which goes over this in detail.   

    There are times when the initial answer isn’t the best one.  In this case, the answer to the question of ‘why is the query not giving us all of the group members?’ was ‘because the query didn’t specify to chase referrals’.  But the best option was to never query the local directory at all and instead direct the query to the global catalog.

    I wanted to do a quick post on an important security bulletin.  It’s Microsoft Security Bulletin MS09-018 – Critical.  This security update is to address a vulnerability in Active Directory.  I’m pasting the Executive Summary below, but I highly recommend that you read the entire bulletin and apply the updates.

    Executive Summary

    This security update resolves two privately reported vulnerabilities in implementations of Active Directory on Microsoft Windows 2000 Server and Windows Server 2003, and Active Directory Application Mode (ADAM) when installed on Windows XP Professional and Windows Server 2003. The more severe vulnerability could allow remote code execution. An attacker who successfully exploited this vulnerability could take complete control of an affected system remotely. An attacker could then install programs; view, change, or delete data; or create new accounts with full user rights. Firewall best practices and standard default firewall configurations can help protect networks from attacks that originate outside the enterprise perimeter. Best practices recommend that systems that are connected to the Internet have a minimal number of ports exposed.

    This security update is rated Critical for all supported editions of Microsoft Windows 2000 Server, and rated Important for supported versions of Windows XP Professional and Windows Server 2003. For more information, see the subsection, Affected and Non-Affected Software, in this section.

    The security update addresses the vulnerability by correcting the way that the LDAP service allocates and frees memory while processing specially crafted LDAP or LDAPS requests.

    Recommendation. The majority of customers have automatic updating enabled and will not need to take any action because this security update will be downloaded and installed automatically. Customers who have not enabled automatic updating need to check for updates and install this update manually. For information about specific configuration options in automatic updating, see Microsoft Knowledge Base Article 294871.

    For administrators and enterprise installations, or end users who want to install this security update manually, Microsoft recommends that customers apply the update immediately using update management software, or by checking for updates using the Microsoft Update service.

    Please apply this update to your Windows 2000 and Server 2003 domain controllers at your earliest opportunity.

    We use the term single sign on (SSO) to describe a variety of behaviors in Windows and other applications where the result is simply to prevent the user from being prompted to provide their credentials again and again; to ideally enter their credentials only once at initial logon. Active Directory and the integrated authentication which it provides does this very well, and can be extended to other Microsoft applications like SQL, SharePoint, Exchange and others from other companies as well.

    There are times, though, when someone needs to create a specific single sign on behavior. This can derive from the need to use a different credential type-smartcards for example, a need to interact with another directory service or application, a need to use one time passwords, or any of a wide variety of things. In those cases you have the option to create and install a credential provider for your client computers and servers for Windows Vista, Server 2008 and later versions. This option is a great thing for developers since programming customized experiences for prior versions of Windows could be more challenging.

    We have detailed information on how to develop credential providers available starting with a great MSDN entry. Another great article on that is “Create Custom Login Experiences With Credential Providers For Windows Vista”.

    Let’s go over a support scenario which underscores that credential providers alter the default behavior and also gives a technique on how to identify whether an additional credential provider may be involved or not.

    In the scenario there was an administrator seeing users unexpectedly receiving the credential prompt when opening a terminal services session to a remote 2008 server from Windows clients. There are multiple reasons this prompt can appear, including broken secure channel on client or server, network problems between client and server, or even problems at the domain controller which is being used to provide the authentication.

    So what do you do when you are seeing credential prompts appear unexpectedly and the more common reasons for that are not there?

    • See if the issue is consistent. If it is intermittent it is less likely to be solely caused by a credential provider.
    • Try removing any added credential providers to see if that makes a difference.
    • Remember that in a client-server relationship the server side may be the culprit as well so check whether a credential provider is installed there.

    How can you tell what credential providers are present? There are a couple of things you can easily do. First, you can look in the registry under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\Credential Providers and see what entries are present.

    This alone may not tell the entire story though since what really matters is whether the credential provider is loaded or not.

    Here’s another way to determine what DLLs are loaded and running code in the LogonUI.exe process. My steps were done on a Windows 7 laptop but would remain the same for Windows Vista or Server 2008.

    1. First, disable UAC temporarily from the Control Panel User Accounts applet.
    2. Create a folder named SC (for example).
    3. Create a text file and put the following text into it: tasklist /m /fi "imagename eq logonui.exe" >c:\sc\result.txt
    4. Rename the file extension of your text file to .CMD
    5. Next, go to Start-->Run and type “gpedit.msc” (without the quotes) and press enter.
    6. Go to User Configuration-->Windows Settings-->Scripts and add your CMD script as a Logon script.

    logonscript

    1. Log off and then log back on.
    2. Open your result.txt. Keep in mind that the results will find all instances of LogonUI.exe-not only one instance but all instances of it if more than one is present. Below are the results from my test, so these results show default DLLs and no additional credential providers.

    Image Name PID Modules

    ========================= ======== ============================================

    LogonUI.exe 5528 ntdll.dll, kernel32.dll, KERNELBASE.dll,

    msvcrt.dll, ole32.dll, GDI32.dll,

    USER32.dll, LPK.dll, USP10.dll, RPCRT4.dll,

    IMM32.DLL, MSCTF.dll, CRYPTBASE.dll,

    CLBCatQ.DLL, ADVAPI32.dll, OLEAUT32.dll,

    authui.dll, COMCTL32.dll, SHLWAPI.dll,

    DUI70.dll, sechost.dll, UxTheme.dll,

    gdiplus.dll, DUser.dll, SndVolSSO.DLL,

    HID.DLL, MMDevApi.dll, SETUPAPI.dll,

    CFGMGR32.dll, DEVOBJ.dll, dwmapi.dll,

    xmllite.dll, WindowsCodecs.dll,

    WINBRAND.dll, VaultCredProvider.dll,

    RpcRtRemote.dll,

    SmartcardCredentialProvider.dll,

    OLEACC.dll, UIAutomationCore.dll,

    PSAPI.DLL, BioCredProv.dll, Secur32.dll,

    SSPICLI.DLL, winbio.dll, CRYPT32.dll,

    MSASN1.dll, credui.dll, VAULTCLI.dll,

    NETAPI32.dll, netutils.dll, srvcli.dll,

    wkscli.dll, SAMCLI.DLL,

    certCredProvider.dll, CRYPTSP.dll,

    rasplap.dll, RASAPI32.dll, rasman.dll,

    WS2_32.dll, NSI.dll, rtutils.dll,

    rsaenh.dll, SXS.DLL, WTSAPI32.dll,

    WINSTA.dll, WinSCard.dll

    Where do you go from here if you have noticed that an additional provider is present in the Tasklist.exe result?  Try preventing it from being loaded in order to see if that alters the behavior in any positive way.  You can prevent the provider from loading typically by backing up (saving) the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Authentication\Credential Providers key and then removing the entries related to that provider from it.  Alternatively the provider may have a registered installation and removal in the Programs Control Panel applet.

    If the provider removal makes a difference where the problem no longer occurs then contacting the manufacturer of that provider is the next logical step since they may already be aware of the behavior that is being seen.

    If removing the provider does not help then you can resume troubleshooting this using other methods like network captures, debug logging or other applicable actions.

    Additional credential providers add great capabilities to your environment, and along with these may come a bit more needed info for your troubleshooting arsenal.  Hopefully this post has added some for you so that you have it when you need it.

    A lot of planning goes into the features and capabilities of each Windows release. Over the years I’ve noticed that there is not a great deal of awareness out in the general public for just how much work and labor goes into a new version of Windows. We’ll most often hear someone say something like “Microsoft comes out with a new version of Windows every few years”….a statement which glosses over the concerted effort of thousands of people in planning, writing, testing and documenting each feature and capability of a new release.

    What is even less seldom discussed is that there are times when the product is put in situations or used for needs that were not tested or planned for in pre-release. It’s a fact of life in IT, though, that business need dictates the applications to which your servers will be used and the situations they will be in. This is summed up best by the Heinz Guderian quote “No plan of battle ever survives contact with the enemy”-though the “enemy” in this situation is really not that at all, but rather our lifeblood at Microsoft-our customers.

    Placing Active Directory domain controllers (DCs) in a network segment that is separated by network address translation (NAT) from the rest of the network-and its peer DCs-is probably one of those situations. There are quite a few distinct business reasons why this sort of network topology will be implemented. The most common is to provide an authenticating domain controller in a demilitarized network zone (DMZ). This can be used to answer authentication requests from application servers which are also placed in the DMZ, or from clients who connecting from outside the DMZ- perhaps via the internet.

    What may be the least preferable reason to place a domain controller in a “NATted” network is to use that network to secure an environment. This can be a controversial statement in light of common guidance and advice around securing perimeter networks. We’ll get into specifics a bit later in this blog post and talk about why there are challenges that make this a less preferred thing to do. A good way of thinking of this is to consider that the Active Directory code-all of those separate components that comprise Active Directory-rely on an underlying foundation of network connectivity to be present and working; a foundation of network connectivity which doesn’t take into account network address translation.

    The Microsoft guidance on hosting domain controllers separated by a NATted network is summed up by the article “Active Directory functionality is not supported over a router that has Network Address Translation (NAT) enabled”. This does some good general guidance on what network connectivity is required for things to work well.

    Now, let’s consider a simple network address translation scenario where domain controllers are involved.

    visio

    In the picture above we see DC A in our corporate environment and DC B in our DMZ environment. DC A has a network address in the 192.x.x.x network and DC B has a network address in the 172.x.x.x network. In itself this is by no means a problem. The difficulty may come into the scenario when we consider how it is domain controllers are found by clients and by each other: DNS.

    Each domain controller will register records in DNS which advertise that the DC can provide services. These are called SRV records. These records in turn map back to a CNAME alias record called the MSDCS record. That record is responsible for providing a resolution to the host record, or A record, which contains the IP address of that DC. The difficult part of a NAT scenario, at least for a domain controller, is that the DMZ DC must register an IP address in its host record which is reachable in some way by its peer domain controller across the network.

    Network address translation works to replace IP header information so that the destination IP is actually different than that which the originator knew of. In our scenario a network device or server between the two DCs (represented as a firewall) does the network address translation.

    This is an acceptable thing as long as the two servers can ultimately find each other and communicate. But without additional steps taken this won’t happen very well.

    DCA = 192.x.x.x

    NAT Int = 192.x.x.x

    NAT Ext = 172.x.x.x

    DMZ_DC = 172.x.x.x

    Why is this a problem? Well, DCA can always initiate AD Replication outbound through the NAT device. However, DMZ_DC cannot by default and will fail when it tries. The problem in this scenario is that DMZ_DC resolves the name "DCA" to the internal IP address of 192.x.x.x and cannot reach the DC. That is because NAT is occurring at the network device and if DMZ_DC sends traffic to that IP it simply won’t make it.

    In this situation, DMZ_DC needs to resolve the name "DCA" to the NAT Ext IP address in order for this to succeed. However, DCA is registering its name in DNS with its true IP address.

    Now, if we manually edit the DNS Host record for DCA on the DMZ_DC side in DNS, it will simply be reregistered by DCA at the next dynamic refresh of DNS by that computer, and then AD Replication will occur in the one way that works consistently and then the problem will exist again.

    The idea here is to tell a component on DCA to register both the true IP address AND the NAT Ext IP address (even though the NAT Ext IP Address isn't bound to DC1).

    We can do this with the DNS Server component. To do that we need to add a registry value on DCA.

    HKLM\SYSTEM\CurrentControlSet\Services\DNS\Parameters

    Registry Value: PublishAddresses

    Registry Value Type: REG_MULTI_SZ

    Registry Value Data:<IP addresses>

    The data should contain the IP addresses we need to register separated by a line feed. (IP addresses should be entered on different lines)

    Why and how will this work? Because the DNS Server component in Windows Server 2003 and Windows Server 2008 have "Netmask Ordering" enabled by default, so DNS Server will return both IP addresses to the client but will list the IP on the same subnet as the client first. So the clients on the Internal network should choose the correct IP address.

    This should also work for the DNS requests from the DMZ_DC. DNS Server will order the list with the IP address on the same subnet as the DNS Client that requested the address, and will chose the next closest IP subnet/class. So on both internal and external sides, the "client" should choose the correct IP address.

    As an additional note, support folks at Microsoft often get asked questions around whether we support specific things or ways of doing things. Placing a DC in a DMZ or NATted area is one of those things. Of course, you can always check to see if we support products by going to our web sites on the Product LifeCycle and that teams corresponding blog. But when it comes down to it Microsoft Customer Services and Support is here to help you use our products, so we’ll do everything we can-as long as it is commercially reasonable to do so-to help you. That doesn’t necessarily mean that we will always provide the answer you want to hear but we’ll try very hard to.

    The above is a scenario that a business can arrive at and deal with in a lot of different ways. The sole intent here is to provide information that can help you use Microsoft products in the ways that you need to. Administrators and planners may need adapt or add to this scenario as needed but hopefully this gives some knowledge to get you started on that path.

    Authentication is entering every facet of our lives nowadays.  It is common to have multiple passwords: passwords for work, home email, and Internet  websites to name a few.  It’s easy to have a lot of different passwords, and equally easy to use only one and risk a widespread identity breach.   Passwords are one way of guaranteeing the identity of an individual in a communication but that’s just a start.  Two factor authentication is becoming more prevalent in the corporate world, and may someday soon be a part of your daily routine in your home life.

    For now, two factor authentication is commonly used as a smartcard plus a user specific password used in an Active Directory domain authentication context.  Let’s take a high level refresher of what a smartcard is and how its usually used in an Active Directory domain.  Smartcards are essentially a small device which contains a certificate issued to user.  The user accesses that certificate by placing the card into a card reader and then supplying a password (or personal identification number) to gain access to that certificate prior to requesting authentication.  This certificate, in lieu of the traditional password string of text, is used in communication with the domain for user logon and authentication. 

    Smartcard logon in part works by having a Domain Controller template based certificate in the authenticating domains local computer certificate stores.  In the more straightforward scenario of an Enterprise Certificate Authority, where information regarding the installed CA is stored in the forest AD, the domain controller certificate is auto enrolled to the domain controller as a matter of course.  That can make for a nice starting place for configuring smartcard logon to work in your environment.

    What if you are a company that maintains a separate certificate authority (CA) from some or all forests and would like to use that CA as an issuer for your smartcard certificates?  There are some clear benefits to doing things that way.  Foremost would be the ability to use one CA to allow smartcard logon for the users in different forests.  This can be useful when you have one unifying corporate structure which has very distinct and separately managed child companies and would also allow for some central control over authentication standards.

    Better written and more technical guidance on smartcard logon for domains and how to do it is in the book Windows Server® 2008 PKI and Certificate Security, and also in the KB article Guidelines for enabling smart card logon with third-party certification authorities.

    The point of this post is not to discuss the value of configuring smartcard logon and how to do it, but rather to talk about what to do when a specific problem involving smartcard logon occurs.

    This is a fairly lengthy premise for a specific problem that you could see: smartcard logon failing while ‘traditional’ credential logon of username plus password succeeds. 

    There are a few different causes that can make this sort of thing happen but the things you want to look at in order to diagnose what is happening are all approximately the same. 

    First, when did it happen?  This can also be a useful piece of information since it can infer what the cause was. Did the problem start after a reboot of the domain controllers?  Did some or all of them seem to fail within a short span of time without a reboot?

    In today’s scenario we’ll put forward that the issue occurred following a reboot of the domain controllers, and that we see some interesting events (below) in the System event log of the domain controllers which seem to relate to the problem.

    Event Type: Error

    Event Source: KDC

    Event Category: None

    Event ID: 7

    Description:

    The Security Account Manager failed a KDC request in an unexpected way. The error

    is in the data field. The account name was host$ and lookup type

    0x0.

    ..and…

    Event Type: Error

    Event Source: KDC

    Event Category: None

    Event ID: 19

    User: N/A

    Computer: <Computer Name>

    Description: This event indicates an attempt was made to use smartcard logon, but

    the KDC is unable to use the PKINIT protocol because it is missing a suitable

    certificate.

    Besides the event logs and the events above one of the most useful tools for this type of issue is Certutil.exe.  Certutil.exe is the tool to use in situations where you need to look into the “health” of the certificates in a store.

    For this situation you would want to run the command

    CertUtil: -verifystore>certverify.txt

    In today’s post scenario here’s we do that and see that the private key for the Domain Controller certificate doesn’t appear to be there.

    ================ Certificate 0 ================

    Serial Number: <snip>

    Issuer: CN=MS CertSrv Test Group CA, OU=Windows NT, O=Microsoft, L=Las Colinas, S=TX, C=US

    Subject: CN=DC5.child2.domain1.com, CN=MS CertSrv Test Group CA, OU=Windows NT, O=Microsoft, L=Las Colinas, S=TX, C=US

    Certificate Template Name: Domain Controller

    Non-root Certificate

    Template: Domain Controller

    Cert Hash: <snip>

    No key provider information

    Missing stored keyset <---MISSING

    Verified Issuance Policies:<snip>

    Verified Application Policies:

        1.3.6.1.5.5.7.3.2 Client Authentication

        1.3.6.1.5.5.7.3.1 Server Authentication

    Certificate is valid

    CertUtil: -verifystore command completed successfully.

    So the above appears to be the problem-the private key is missing.  Oddly, when we run the verifystore on the other domain controllers of the domain we see the same subject reference of CN=DC5.child2.domain1.com.  That in itself is a problem since every domain controller should have its own uniquely issued domain controller certificate.

    For DC5 perhaps the association to the private key is the only thing missing.  In other words, perhaps the private key and rest of the certificate are both there but just not “linked” to each other for some reason.  What can we do to repair that if it can be repaired?

    Certutil.exe to the rescue once more.  We can use that tool to repair things with the command below, using the serial number value found in the verifystore command.  For the domain controller, however, we would need to do that in the DC context which is something you should be able to achieve by using the AT command, either launching a command prompt interactively if allowed or by putting the command below into a batch file and running it that way.

    certutil -repairstore my "serial number”

    Viola!, we reboot DC5 and suddenly it can service smartcard logon requests.  The other domain controllers are another matter though.  In a situation such as this the other domain controllers must go through the entire request process for their own Domain Controller certificates. 

    So what can we hypothesize happened in this scenario?   Domain Controller template based certificates are issued to specific hosts and cannot be used on a computer other than the one they are issued to.  In light of seeing Domain Controller type certificate Subject field on all of the affected domain controllers all containing the host name of DC5 we can guess that someone exported the certificate from DC5 and imported it to all of the domain controllers in the domain.  For the DCs which were not DC5 of course that certificate would never work, and barring the little host uniqueness aspect there the default Domain Controller certificate template does not allow for private key export anyway (which is a good thing).    That may also explain why the DC5 certificate had a problem associated with it’s own private key-the certificate was exported without it, and then that exported certificate was imported in, thus breaking the association.

    More often you’ll see similar behavior if you are using a third party or non-Enterprise certificate authority and the domain controller certificates expire.   That’s a problem that folks with Microsoft Enterprise CAs are not likely to see since the domain controllers will auto enroll in those certificates.

    The scenario discussed here is by no means a common one.  I’m passing it along, though, to lend some insight into some AD and authentication specific behavior and some troubleshooting that can be applied to a variety of similar issues.  Hopefully you won’t see problems, but if you do then let’s hope this info helps you out.

    There will be times when you have to make big changes in your Active Directory. Sometimes those big changes mean deleting a lot of objects. I’ve personally needed to match customer environments by creating tens of thousands of AD objects just to have the beginnings of a matching environment. For my test forests I can leave those objects around after I’m done and not have to worry about things.

    But if I have a production forest I will probably want to delete unused objects. I’ll also want to reclaim that disk space and the possible performance from indexes that might be filled with these remaining object references.

    A more pointed scenario would be someone who had a maverick provisioning software that created a massive number of unwanted new user objects. These objects would replicate throughout the domain they are in as well as into the global catalogs throughout the forest and would bloat the AD database.

    Such a thing could increase a 50Mb Active Directory database to 50Gb one in pretty short order.

    Whatever the chain of events that got you to this point you are now in the position of cleanup-deleting all of the unwanted objects. Let’s take a moment to do a quick and high level run though of how the object deletion process works in AD. When an object is deleted by you in Active Directory Users and Computers what really happens to that object is that all but a few attributes of that object are discarded, the object is moved to the Deleted Items container, and it receives a time stamp showing when it was marked for deletion.

    This object is retained for a length of time. That time is known as the tombstone lifetime (TSL). At the end of that length of time that object will be removed by a thread that runs on each domain controller at startup and about every 12 hours afterward. That thread is called Garbage Collection. Picture it as a dumpster carrying trash truck that pulls up to each deleted object and quickly examines them to see if the deletion time on them is greater than the TSL or not. If they are then the object goes into the dumpster (figuratively speaking) and is finally deleted and removed from the database. This process is also explained here.

    In order to do that very quickly-not having to wait the TSL of 180 or 60 days- you would have to do something we don’t recommend: alter your tombstone lifetime (TSL) to a shorter interval and then garbage collection will remove the objects more quickly next time it runs. We have a KB article that talks about the problems you can see altering this value and why it is generally a bad idea. For the sake of this article we’re going to assume you’re either a Cowboy Admin and have lowered TSL to a small length of time despite Microsoft recommendations, or you have the patience of a saint (Saint Admin? Feels like we should have a patron saint, doesn’t it?).

    But once its complete you notice that-while the DIT has decreased a little bit-it hasn’t gotten close to the original size. What’s going on? Didn’t the garbage collection take out all the trash?

    The reason that the DIT only decreased a small amount was the result of the dumpster being too small to fit all of that large set of deleted objects into it. There are simply too many deleted objects (which were deleted longer ago than the tombstone lifetime) to fit into the dumpster. Seriously.

    When the garbage collection thread runs it takes a batch of 5000 objects that match the criteria of having been deleted greater than the tombstone lifetime ago. Once it has removed that batch from the database it will pause in order to let more important AD business take place. What this means in reality is that only that batch may be done during one garbage collection interval of 12 hours and then you would have to wait for the next collection to see the next 5000 get removed.

    Is there a way to speed that process up if you need to? Yes, there is.

    You can initiate garbage collection manually by using a published LDAP control. This doesn’t alter the what objects are collected, nor does it alter how may go into the dumpster. It simply says to do garbage collection right then rather than waiting until the next 12 hour interval has passed.

    You can use LDP.EXE to do the garbage collection control. Here are the steps:

    1. In Ldp.exe, when you click Browse on the Modify menu, leave the Distinguished name box empty.

    2. In the Edit Entry Attribute box, type "DoGarbageCollection" (without the quotation marks),

    3. In the Values box, type "1" (without the quotation marks).

    4. Set the Operation value set to Add and click the Enter button, and then click Run.

    It’s possible that the garbage collection you start using the above method could stop in favor of more important tasks like AD replication in the same way as the scheduled garbage collection does. If that happens you can simply repeat the garbage collection steps above until all of the objects are removed.

    How can you tell if they are removed? We have a KB article which goes over how to view your deleted objects. Take note that you may need to alter the size limit variable if you have a large number of deleted objects.

    What about all of that free space? Can we get it all back just by doing the garbage collection and removing all of the objects that qualify?

    Online defrags may reclaim some of that space-an online defrag will occur as part of the garbage collection- but the best thing to do is reboot to Directory Services Restore Mode (DSRM) and run an offline defragmentation of the database.

    Keep in mind that garbage collection is not replicated in any way.  In other words, the routine you go through for garbage collection and database defragmentation needs to be performed on each domain controller individually.  It would not necessarily be a problem to only force garbage collection on some domain controllers and not others but of course you may see performance differences between those that have had the trash taken out and those that haven’t yet.

    For IT folks who are remote from their domain controllers there’s a nice little option for booting to DSRM without having to resort to using the F8 on the keyboard (which may be thousands of miles away). Just go to your Run command and type MSCONFIG and press enter. Here’s the option in Server 2003:

    msconfig2k3

    …and for Server 2008:

    msconfig2k8

    This scenario may not necessarily result from a sudden creation of a huge number objects but could be the result of a gradual database increase over years in production. The resolution portion will not be different in either case. Follow those same steps and just take out the trash.

    I decided that we needed some more detail and to give a walk through scenario on this downgrade attack deal I mentioned a while back in a blog post.

    As a recap, a customer called in after noticing the events below appearing intermittently but repeatedly-and always in the sequence of one after the other- in the System event log:

    Event Type: Warning

    Event Source: LSASRV

    Event Category: SPNEGO (Negotiator)

    Event ID: 40960

    Date: 01/01/2009

    Time: 8:07:01 PM

    User: N/A

    Computer: FS123

    Description: The Security System detected an attempted downgrade attack for server cifs/dc5.sales.adatum.com. The failure code from authentication protocol Kerberos was "There are currently no logon servers available to service the logon request.”

    Event Type: Warning

    Event Source: LSASRV

    Event Category: SPNEGO (Negotiator)

    Event ID: 40961

    Date: 01/01/2009

    Time: 8:07:01 PM

    User: N/A

    Computer: FS123

    Description: The Security System could not establish a secured connection with the server cifs/dc5.sales.adatum.com. No authentication protocol was available.

    Of course the part that was most alarming was the attempted downgrade attack text. Attack is not a very friendly sounding word and usually infers that there is a person or identity behind the attack, instigating it. Naturally this is something an administrator would want to follow up on!

    Let’s start by defining what a downgrade attack can be. A downgrade attack would be where a connection to obtain a resource starts with an more secure method of authentication but due to some reason must settle for a less secure method of authentication in order to authenticate and gain access to a resource. Kerberos, for example, is a more secure authentication method than NTLM and hence would be preferred and in fact is preferentially selected in security negotiation in every situation where it can be.

    The word “attack” though suggests that in every case where we attempt Kerberos and end up using NTLM there was a malicious entity behind that when there generally would not be. There are situations where the 40960 and 40961 event sequence will be useful in identifying actual maliciously inspired behavior but for the most part the cause will be something far less dramatic or evil sounding.

    A quick search on the interwebs finds several references to these events. The most informative is here. This Technet event description does a good job of telling us that there can be multiple causes of this event and suggests that it should appear in the event reason code info. The example given is STATUS_NO_LOGON_SERVERS. This is an excellent example since it is probably the most common instigator of this series of events.

    So let’s go over a scenario where the 40960 and 40961 can occur from STATUS_NO_LOGON_SERVERS. Picture our file server FS123 is doing it’s normal business as a domain joined member server when a user on it or a service on it suddenly needs to access a file on DC5. FS123 keeps track of where domain controllers for its domain (Sales) are located by having a cache of this information which is maintained by the Netlogon service, and this cache contains information on where a responsive KDC for Sales is on the network.

    So naturally when FS123 attempts to access a file on DC5 it negotiates Kerberos as the selected authentication method. That negotiation is what you will see on the network as SPNEGO information embedded in the SMB traffic from FS123 to DC5 and back. When DC5 responds in that SPNEGO response that it supports Kerberos FS123 knows that it needs to get a ticket for DC5 for the file service. In other words it needs a ticket for the service principal name of cifs/dc5.sales.adatum.com, and so FS123 sends that request out to the KDC it knows of in its cache.

    But here’s where a problem comes in-the KDC it knows of is not responsive suddenly. As a result the Netlogon service provides a status saying that back to the file request: STATUS_NO_LOGON_SERVERS. The file request then must be completed using another authentication method like NTLM. Our events 40960 and 40961 are then logged in this case in order to show that we attempted this more secure authentication method but were not successful.

    In our scenario above the file access and the application or user who initiated it probably succeeded in getting access to that file or files without ever noticing this transaction or a delay. But that leaves us with some questions around why that occurred in the first place? Why were we not able to use Kerberos?

    The most common cause for this if the events are seen intermittently is that there is a transient network problem between the client (in our scenario FS123) and the DC it is looking to at that time for authentication. There could be many other causes making that DC less responsive, up to and including the domain controller seeing a performance “spike” and becoming too busy to respond quickly to the Kerberos ticket request from FS123.

    From the FS123 side of things the Netlogon service will actually locate a new, more responsive DC when these things occur but there will be a short interval where things like this may happen. That’s the window where occasional events from our topic occur.

    So how can you use this information? This can be used as a guideline to understand whether there is a transient issue going on or perhaps an actual intrusion where someone is making the authentication method used for connections intentionally less secure in order to more easily break it. The former (transient issues resulting in our 40960 and 40961 event sequence) is not a surprising thing to see occasionally in an enterprise environment. The latter (maliciously intentional cause) is rare to say the least but a good administrator slash security person will explore each and every one of these events. To do that simply enabled netlogon debug logging on the servers or workstations that see the events and look for corresponding errors occurring at the same time as the events, or look through the event logs for other corresponding events at or around that time.

    As a post script, I’ve gotten several great questions from folks via the blog over the past few weeks. I intend to respond to them but have to confess it may be delayed-my apologies for that folks.

    I find myself doing blog posts on things that are not frequent enough for most experienced admins to be aware of since it wouldn’t come across their desk often. The reason for that is that in my role I receive the least common unresolved issues that occur from our customers. When I receive a few of them over the years I feel that there can be some value in documenting them informally on the blog.

    This one is a case in point. For years we’ve used the Volume Shadow Copy Service as the foundation of the backups that we do in our product. I will not claim to be an overall expert in VSS (if you want to be you can go here) but I do want to relate to you how VSS can affect your domain controllers services in some cases during a backup.

    Consider a scenario where you have two domain controllers for a specific AD site. These two DCs provide services to an application server and do nothing else-no workstations are in that site, no other member servers, just that one application server and the single application running on it. As part of what that application needs to do it sends frequent and a high volume of LDAP and authentication queries to the domain controllers. This keeps the DCs quite busy indeed on a relatively constant basis; close enough to be having the beginnings of noticeable performance bottlenecks on the DC’s disks during normal usage.

    Now add to the scenario a system state backup (Windows Backup or NTBackup) taken on one of the DCs. During the start of the backup you notice that some of the application queries and authentication requests fail for a short period. Not all of them, but enough showing up in your application servers event logs that it raises a concern. You may not even have end user complaints but it is noticed as an issue. These failures only occur during the “preparation” phase of the DC system state backups, interestingly enough, and never exceed 60 seconds.

    Preparing to backup

    clip_image001

    So what is going on?

    To understand what is happening we need to understand a few details about how VSS works. VSS basically takes snapshots of the disk data at the time it runs. It is advertised as a seamless backup service-meaning no interruption-because this snapshot is quickly taken and all backup writing and details take place by working with the snapshot, not the live data on disk. This allows the backup process to be seamless to the user since normal services are not being interrupted throughout the backup. The snapshot is what is happening in the ‘preparation’ phase of taking a system state backup using VSS capable backup utilities like Windows Backup.

    However, while the snapshot is being taken VSS imposes a temporary halt to disk writes-but allowing disk reads. To be more precise, there is an Active Directory implementation of the VSS snapshot API which works with VSS to do this. Other applications which use ESE databases, like Exchange, have their own implementation of the snapshot code as well. Going forward, for AD that means there is a short interval where no database writes can take place. This period is typically so short as to not be noticed 99.999% of the time, but there are factors which can make this period longer.

    Those factors are:

    · High disk utilization taking place, indicated by average disk queue lengths being long. This factor would likely be occurring at all times but would spike during the backup process.

    · An application which has a low timeout threshold, client side, for its requests and no retry or failover behaviors in case of a temporary lack of response for an action from the DC.

    · Lower memory conditions where more of the database is paged onto disk (page file) and would require more disk access to read in order for the snapshot to proceed.

    As mentioned above, Exchange has some well documented information about this. That info can be found in the Knowledge Base and at MSDN.

    So how can you tell if the behavior you are seeing is related to a scenario like this? We can look in the Event Viewer ESE (database) Freeze and Thaw events in the Application event log during the preparation phase of your backups.

    When the backup preparation begins you will see the ESE (remember that ESE is the type of database AD runs as) source event below:

    clip_image002

    When it ends you will see its companion event:

    clip_image003

    Note that in some cases the event 2003 above will have slightly different wording which includes the word “thaw”.

    More information can be found here.  The Freeze and Thaw intervals correspond to the preparation phase of the backup. The pertinent snippet from the above MSDN article is:

    Shadow Copy Freeze and Thaw

    The creation of every VSS shadow copy operation is bracketed by Freeze and Thaw events, which writers use to put their files in a stable state prior to shadow copy.

    Having Freeze and Thaw events as part of the VSS model means:

    Handling the Freeze event means that those who are developing writers must have a clearly delineated point in the backup cycle where they ensure that all write operations to the disk are stopped and that files are in a well-defined state for backup.

    Handling the Thaw event provides the mechanism for writers to resume writes to the disk and clean up any temporary files or other temporary state information that were created in association with the shadow copy.

    The default window between the Freeze and Thaw events is short (typically 60 seconds); therefore, actual interruption of any service that a writer provides can be minimized.

    Handling of other events (such as PrepareForSnapshot) preceding and following the Freeze and Thaw events, respectively, provides the necessary flexibility to allow writers to complete complicated operations to support shadow copies.

    How can you tell that this issue is affecting you? If you have application side behavior that correspond to the events 2001 and 2003 then it’s time to do some performance logging on your domain controllers and look for performance bottlenecks. Server Performance Advisor or the Perfmon AD Data Collector in Server 2008 tests ran during the backup are also a good tool for getting a handle on what is going on.

     

    What can you do if you have verified that you are seeing this unusual issue? Here’s what I would recommend:

    · Alter the application behavior to better accommodate an occasional delay in server responses from DC.

    · Consider moving to x64 platform for the DCs, with more RAM and augmented by more robust drives and network devices. This should make the VSS freeze and thaw intervals even less perceptible.

    · Decrease the frequency of the backups for those domain controllers only as a last resort.

     

    Hopefully this helps in another less common scenario and gives a better understanding of how things work under the hood in AD.

    More Posts Next page »
     
    Page view tracker