The PDCe with too much to do

The PDCe with too much to do

  • Comments 7
  • Likes

Hi. Mark again. As part of my role in Premier Field Engineering, I’m sometimes called upon to visit customers when they have a critical issue being worked by CTS, needing another set of eyes. For today’s discussion, I’m going to talk you through, one such visit.

It was a dark and stormy night …

Well not really – it was mid-afternoon but these sorts of things always have that sense of drama.

The Problem

Custom applications were hard coded to use the PDC Emulator (PDCe) for authentication – a strategy the customer later abandoned to eliminate a single point of failure. The issue was hot because the PDCe was not processing authentication requests after a reboot.

The customer had noticed lsass.exe consuming a lot of CPU and this is where CTS were focusing their efforts.

The Investigation

Starting with the Directory Service event logs, I noticed the following:

Event Type:          Information

Event Source:        NTDS Replication

Event Category:      Replication

Event ID:            1555

Date:                <Date>

Time:                <Time>

User:                NT AUTHORITY\ANONYMOUS LOGON

Computer:            <Name of PDCe>

Description:

The local domain controller will not be advertised by the domain controller locator service as an available domain controller until it has completed an initial synchronization of each writeable directory partition that it holds. At this time, these initial synchronizations have not been completed.

 

The synchronizations will continue.

 

also:

Event Type:          Warning

Event Source:        NTDS Replication

Event Category:      Replication

Event ID:            2094

Date:                <Date>

Time:                <Time>

User:                NT AUTHORITY\ANONYMOUS LOGON

Computer:            <Name of PDCe>

Description:

Performance warning: replication was delayed while applying changes to the following object. If this message occurs frequently, it indicates that the replication is occurring slowly and that the server may have difficulty keeping up with changes.

Object DN: CN=<ClientName>,OU=Workstations,OU=Machine Accounts,DC=<Domain Name>,DC=com

 

Object GUID: <GUID>

 

Partition DN: DC=<Domain Name>,DC=com

 

Server: <_msdcs DNS record of replication partner>

 

Elapsed Time (secs): 440

 

 

User Action

 

A common reason for seeing this delay is that this object is especially large, either in the size of its values, or in the number of values. You should first consider whether the application can be changed to reduce the amount of data stored on the object, or the number of values.  If this is a large group or distribution list, you might consider raising the forest version to Windows Server 2003, since this will enable replication to work more efficiently. You should evaluate whether the server platform provides sufficient performance in terms of memory and processing power. Finally, you may want to consider tuning the Active Directory database by moving the database and logs to separate disk partitions.

 

If you wish to change the warning limit, the registry key is included below. A value of zero will disable the check.

 

Additional Data

 

Warning Limit (secs): 10

 

Limit Registry Key: System\CurrentControlSet\Services\NTDS\Parameters\Replicator maximum wait for update object (secs)

 

 

and:

Event Type:          Warning

Event Source:        NTDS General

Event Category:      Replication

Event ID:            1079

Date:                <Date>

Time:                <Time>

User:                <SID>

Computer:            <Name of PDCe>

Description:

Internal event: Active Directory could not allocate enough memory to process replication tasks. Replication might be affected until more memory is available.

 

User Action

Increase the amount of physical memory or virtual memory and restart this domain controller.

 

 

In summary, the PDCe hasn’t completed initial synchronisation after a reboot and it’s having memory allocation problems while it works on sorting it out. Initial synchronisation is discussed in:

Initial synchronization requirements for Windows 2000 Server and Windows Server 2003 operations master role holders
http://support.microsoft.com/kb/305476

With this information in hand, I had a chat with the customer hoping we’d identify a relevant change in the environment leading up to the outage. It became apparent they’d configured a policy for deploying RDP session certificates. Furthermore, they’d noticed clients receiving many of these certificates instead of the expected one.

RDP session certificates are Secure Sockets Layer (SSL) certificates issued to Remote Desktop servers. It is also possible to deploy RDP session certificates to client operating systems such as Windows Vista and Windows 7. More on this later…

The customer and I examined a sample client and found 285 certificates! In addition to this unusual behaviour, the certificates were being published to Active Directory. There were 3700 affected clients – approx. 1 million certificates published to AD!

The Story So Far

We’ve injected huge amounts of certificate data into the userCertificate attribute of computer objects, we’ve got replication backlog due to memory allocation issues and the DC can’t complete an initial sync before advertising itself as a DC.

What Happened Next Uncle Mark?!

The CTS engineer back at home base wanted to gather some debug logging of LSASS.exe. While attempting to gather such a log, the PDCe became completely unresponsive and we had to reboot.

While the PDCe rebooted, the customer disabled the policy responsible for deploying RDP session certificates.

After the reboot, the PDCe had stopped logging event 1079 (for memory allocation failures) but in addition to event 1555 and 2094, we were now seeing:

Event Type           Warning

Event Source:        NTDS Replication

Event Category:      DS RPC Client

Event ID:            1188

Date:                <Date>

Time:                <Time>

User:                NT AUTHORITY\ANONYMOUS LOGON

Computer:            <Name of PDCe >

Description:

A thread in Active Directory is waiting for the completion of a RPC made to the following domain controller.

 

Domain controller:

<_msdcs DNS record of replication partner>

Operation:

get changes

Thread ID:

<Thread ID>

Timeout period (minutes):

5

 

Active Directory has attempted to cancel the call and recover this thread.

 

User Action

If this condition continues, restart the domain controller.

 

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

A bit more investigation with:

Repadmin.exe /showreps (or /showrepl for later versions of repadmin)

told us that all partitions were in sync except the domain partition – the partition with a million certificates attached to computer objects.

We decided to execute:

Repadmin.exe /replicate <Name of PDCe> <Closest Replication Partner> <Domain Naming Context> /force

Next, we waited … for several hours.

While waiting, we considered:

  • Disabling initial sync with:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters]

Repl Perform Initial Synchronizations = 0

  • Increasing the RPC timeout for NTDS with:

http://support.microsoft.com/default.aspx?scid=kb;EN-US;830746

Both of these changes require a reboot. The customer was hesitant to reboot again and while they thought it over, initial sync completed.

With the PDCe authenticating clients, I headed home to get some sleep. The customer had disabled the RDP session certificate deployment policy and was busy clearing the certificate data out of computer objects in Active Directory.

Why?

The next day, I went looking for root cause. The customer had followed some guidance to deploy the RDP session certificates. Some of the guidance noted during the investigation is posted here:

http://blogs.msdn.com/b/rds/archive/2010/04/09/configuring-remote-desktop-certificates.aspx

I set up a test environment and walked through the guidance. After doing so, I did not experience the issue. I was getting a single certificate no matter how often I would reboot or apply Group Policy. In addition, RDP session certificates were not being published in Active Directory. Publishing in Active Directory is easily explained by this checkbox:

image

An examination of the certificate template confirmed they had this checked.

So why were clients in the customer environment receiving multiple certificates while clients in my test environment received just one?

The Win

I noticed the following point in the guidance being followed by the customer:

image

A bit of an odd recommendation. Sure enough, the customer’s template had different names for “Template display name” and “Template name”. I changed my test environment to make the same mistake and suddenly I had a repro – a new certificate on every reboot and policy refresh.

Some research revealed that this was a known issue. One of these fields checks whether an RDP session certificate exists while the other field obtains a new certificate. Giving both fields the same name works around the problem.

Conclusion

So in the aftermath of this incident, there are some general recommendation that anyone can take to help avoid this kind of situation.

  • Follow our guidance carefully – even the weird stuff
  • Test before you deploy
  • Deploy the same way as you test
  • Avoid making critical servers more critical than they need to be

- Mark “Falkor” Renoden

  • Here's a bit more formal (and brief) description of the issue: support.microsoft.com/.../2531138.

  • Wow, crazy story.

    I learned to start with the documentation after deploying Group Chat in my test lab.  Despite the fact you _can_ rename the service account...there is (or was) a known issue where the account had to be named OCSChat.  The Technet article even said something light, to the effect of "we recommend naming the account OCSChat", despite the fact it should have read "It is required to be named OCSChat"

  • BTW, the symptoms are a bit weird. In my lab I have the following since May.

    1. The template name is NOT exactly “RemoteDesktopComputer”.

    2. The template display name and template CN ARE the same.

    3. The “Publish certificate in Active Directory” is NOT selected. (Why in the world do I need it for Remote Desktop certs?).

    4. The certificates are NOT renewed by themselves. In fact, theres's one certificate per machine. Everything works pretty as expected.

  • Yeah, and there's one more difference. My template is v3 one though the RDS team blog post suggests it should be a v2 one.

  • @Pronichkin

    That KB article was actually the result of this issue.

    The requirement isn't that the names be equal to "RemoteDesktopComputer", simply that they are the same - which you have.

    There is no good reason to publish these certificates in Active Directory. It was simply a contributing factor in this particular issue.

    - Mark

  • Hi Mark

    As the customer in question, I like to set the record straight on a number of points.

    We followed a TechNet article when configuring Win7 RDP cert enrolment, which makes no mention of the enrolment bug that we encountered or any workaround that should be used.  We were an early adopter of Win7, so the guidance in the blog that you suggest we should have followed did not exist yet.  

    We did test this configuration prior to production implementation - one of the first issues we noticed was that RD was not using the client/server authentication certs that machines had already successfully enrolled for (for 802.1x and SCCM).   So from day one, a design limitation in the way that that RDP selects local certificates created a situation where it was “normal” for a Win7 machine to enrol for more than one certificate for exactly the same purpose.  I think you’ll agree that without the benefit of hindsight, it would be hard under these circumstances to spot the upcoming trouble.  Indeed, it took a considerable period of time before there were sufficient Win7 machines in the environment for the problem to reach critical mass and become a problem in the AD.

    I guess we reached some different conclusions to you, notably the importance of establishing a long-term monitoring baseline for AD memory usage, and to question anomalies like having to enrol for a second cert with the same usage.  Agree fully on the point of developers hard coding apps to point at named DC’s, or DC’s that hold the PDC FSMO role, and we are working very hard to address this.

    Cheers

  • Hiya SharpDev,

    I don't believe Mark was stating that you did did things wrong or didn't test. The point of this blog most is more about how to troubleshoot (the issue is already well documented and old at this point, after all). I added a few sentences to make sure that's clear.

    Every article we ever publish is, at it's heart, about some anonymous customer - and we take great pains to ensure every post contains no identifiable info. Some even take a rather perverse pride in it. :)

    Thanks for the feedback.