Blog - Title

August, 2010

  • New DNS and AD DS BPA’s released (or: the most accurate list of DNS recommendations you will ever find from Microsoft)

    Hi folks, Ned here again. We’ve released another wave of Best Practices Analyzer rules for Windows Server 2008 / R2, and if you care about Directory Services you care about these:

    AD DS rules update

    Info: Update for the AD DS Best Practices Analyzer rules in Windows Server 2008 R2
    Download: Rules Update for Active Directory Domain Services Best Practice Analyzer for Windows Server 2008 R2 x64 Editions (KB980360)

    This update BPA for Active Directory Domain Services include seven rules changes and updates, some of which are well known but a few that are not.

    DNS Analyzer 2.0

    Operation Info: Best Practices Analyzer for Domain Name System – Ops
    Configuration info: Best Practices Analyzer for Domain Name System - Config
    Download: Microsoft DNS (Domain Name System) Model for Microsoft Baseline Configuration Analyzer 2.0

    Remember when – a few weeks back – I wrote about recommended DNS configuration and I promised more info? Well here it is, in all its glory. Despite what you might have heard, misheard, remembered, or argued about, this is the official recommended list, written by the Product Group and appended/vetted/munged by Support. Which includes:

    Awww yeaaaahhh… just memorize that and you’ll win any "Microsoft recommended DNS" bar bets you can imagine. That’s the cool thing about this ongoing BPA project: not only do you get a tool that will check your work in later OS versions, but the valid documentation gets centralized.

    - Ned “Arren hates cowboys” Pyle

  • Moving Your Organization from a Single Microsoft CA to a Microsoft Recommended PKI

    Hi, folks! Jonathan here again, and today I want to talk about what appears to be an increasingly common topic: migrating from a single Windows Certification Authority (CA) to a multi-tier hierarchy. I’m going to assume that you already have a basic understanding of Public Key Infrastructure (PKI) concepts, i. e., you know what a root CA is versus an issuing CA, and you understand that Microsoft CAs come in two flavors – Standalone and Enterprise. If you don’t know those things then I recommend that you take a look at this before proceeding.

    It seems that many organizations had installed a single Windows CA in order to support whatever major project that may have required it. Perhaps they were rolling out System Center Configuration Manager (SCCM), or wireless, or some other certificate consuming technology and one small line item in the project’s plan was Install a CA. Over time, though, this single CA began to see a lot of use as it was leveraged more and more for purposes other than originally conceived. Suddenly, there is a need for a proper Public Key Infrastructure (PKI) and administrators are facing some thorny questions:

    1. Can I install multiple PKIs in my forest without them interfering with each other?
    2. How do I set up my new PKI properly so that it is scalable and manageable?
    3. How do I get rid of my old CA without causing an interruption in my business?

    I’m here to tell you that you aren’t alone. There are many organizations in the same situation, and there are good answers to each of these questions. More importantly, I’m going to share those answers with you. Let’s get started, shall we?

    Important Note: This blog post does not address the private key archival scenario. Stay tuned for a future blog post on migrating archived private keys from one CA to another.

    Multiple PKIs In The Forest? Isn’t That Like Two Cats Fighting Over the Same Mouse?

    Uh….no.

    (You know, I actually considered asking Ned to find some Office clip art that showed two cats fighting over a mouse, and then thought, “What if he found it?!” I decided I didn’t really want to know and bagged the idea.)

    To be clear, there is absolutely no issue with installing multiple Windows root CAs in the same forest. You can deploy your new PKI and keep it from issuing certificates to your users or computers until you are good and ready for it to do so. And while you’re doing all this, the old CA will continue to chug along oblivious to the fact that it will soon be removed with extreme prejudice.

    Each Windows CA you install requires some objects created for it in Active Directory. If the CA is installed on a domain member these objects are created automatically. If, on the other hand, you install the CA on a workgroup computer that is disconnected from the network, you’ll have to create these objects yourself.

    Regardless, all of these objects exist under the following container in Active Directory:

    CN=Public Key Services, CN=Services, CN=Configuration, DC=<forestRootPartition>

    As you can see, these objects are located in the Configuration partition of Active Directory which explains why you have to be an Enterprise Admin in order to install a CA in the forest. The Public Key Services Container holds the following objects:

    CN=AIA Container

    AIA stands for Authority Information Access, and this container is the place where each CA will publish its own certificate for applications and services to find if needed. The AIA container holds certificationAuthority objects, one for each CA. The name of the object matches the canonical name of the CA itself.

    CN=CDP Container

    CDP stands for CRL Distribution Point (and CRL stands for Certificate Revocation List). This container is where each CA publishes its list of revoked certificates to Active Directory. In this container, you’ll find another container object whose common name matches the host name of the server on which Certificate Services is installed – one for each Windows CA in your forest. Within each server container is a cRLDistributionPoint object named for the CA itself. The actual CRL for the CA is published to this object.

    CN=Certificate Templates Container

    The Certificate Templates container holds a list of pKICertificateTemplate objects, each one representing one of the templates you see in the Certificate Templates MMC snap-in. Certificate templates are shared objects, meaning they can be used by any Enterprise CA in the forest. There is no CA-specific information stored on these objects.

    CN=Certification Authorities Container

    The Certification Authorities container holds a list of certificationAuthority objects representing each root CA trusted by the Enterprise. Any root CA certificate published here is distributed to each and every member of the forest as a trusted root. A Windows root CA installed on a domain server will publish its certificate here. If you install a root CA on a workgroup server you’ll have to publish the certificate here manually.

    CN=Enrollment Services

    The Enrollment Services container holds a list of pKIEnrollmentService objects, each one representing an Enterprise CA installed in the forest. The pKIEnrollmentService object is used by Windows clients to locate a CA capable of issuing certificates based on a particular template. When you add a certificate template to a CA via the Certification Authority snap-in, that CA’s pKIEnrollmentService object is updated to reflect the change.

    Other Container

    There are few other objects and containers in the Public Key Services container, but they are beyond the scope of this post. If you’re really interested in the nitty-gritty details, post a comment and I’ll address them in a future post.

    To summarize, let’s look at a visual of each of these objects and containers and see how they fit together. I’ve diagrammed out an environment with three CAs. One is the Old And Busted CA, which has been tottering along for years ever since Bob the network admin put it up to issue certificates for wireless authentication.

    Now that Bob has moved onto new and exciting opportunities in the field of food preparation and grease trap maintenance after that unfortunate incident with the misconfigured VLANs, his successor, Mike, has decided to deploy a new, enterprise-worthy PKI.

    To that end, Mike has deployed the New Hotness Root CA, along with the More New Hotness Issuing CA. The New Hotness Root CA is an offline Standalone root, meaning it is running the Windows CA in Standalone mode on a workgroup server disconnected from the network. The New Hotness Issuing CA, however, is an online issuing CA. It’s running in Enterprise mode on a domain server.

    Let’s see what the AD objects for these CAs look like:

    clip_image001

    Figure 1: Sample PKI AD objects

    We’ve come an awful long way to emphasize one simple point. As you can see, each PKI-related object in Active Directory is uniquely named, either for the CA itself or the server on which the CA is installed. Because of this, you can install a (uniquely named) CA on every server in your environment and not run into the sort of conflict that some customers fear when I talk to them about this topic. You could also press your tongue against a metal pole in the dead of winter. Of course, it would hurt, and you’d look silly, but you could do it. Same concept applies here.

    So what’s the non-silly approach?

    The Non-Silly Approach

    If you need to migrate your organization from the Old And Busted CA to the New Hotness PKI, then the very first thing you should do is deploy the new PKI. This requires proper planning, of course; select your platform, locate your servers, that sort of thing. I encourage you to use a Windows Server 2008 R2 platform. WS08R2 CAs are supported with a minimum schema version of 30 which means you do not need to upgrade your Windows Server 2003 domain controllers. More details are here.

    Once your planning is complete, deploy your new PKI. Actual step-by-step guidance is beyond the scope of this blog post, but it is pretty well covered elsewhere. You should first take a look at the Best Practices for Implementing a Microsoft Windows Server 2003 Public Key Infrastructure. Yes, I realize this was for Windows Server 2003, but the concepts are identical for Windows Server 2008 and higher, and the scripts included in the Best Practices Guide are just as useful for the later platforms. It is also true that the Guide describes setting up a three tiered hierarchy, but again, you can easily adapt the prescriptive guidance to a two tiered hierarchy. If you want help with that then you should take a look at this post.

    The major benefit to using Windows Server 2008 or higher is a neat little addition to the CAPolicy.INF file. When you install a new Enterprise CA it is preconfigured with a set of default certificate templates for which it is ready immediately to start issuing certificates. You don’t really want the CA to issue any certificates until you’re good and ready for it to do so. If the Enterprise CA wasn’t configured with any templates by default then it wouldn’t issue any certificates after the CA starts up. When you were ready to switch over to the new PKI, you’d just configure the issuing CA with the appropriate templates. It turns out that as of Windows Server 2008 you can install an Enterprise issuing CA so that the default certificate templates were not automatically configured on the CA. You accomplish this by adding a line to the CAPolicy.inf file:

    LoadDefaultTemplates=False

    Now, if at this point you’re wondering, “What is a CAPolicy.INF file, and how is it involved in setting up a CA,” then guess what? That is your clue that you need to read the Best Practices Guide linked above. It’s all in there, including samples.

    “Oh…but the samples are for Windows Server 2003,” you say, accusingly. Relax; here’s a blog post I wrote earlier fully documenting the Windows Server 2008 R2 CAPolicy.INF syntax. Again, the concepts and broad strokes are all the same; just some minor details have changed. Use my earlier post to supplement the Best Practices Guide and you’ll be golden.

    I Have My New PKI, So Now What?

    So you have your new PKI installed and you’re ready to migrate your organization over to it. How does one do that without impacting one’s organization too severely?

    The first thing you’ll want to do is prevent the old CA from issuing any new certificates. You just uninstall it, of course, but that could cause considerable problems. What do you think would happen if that CA’s published CRL expired and it wasn’t around to publish a new one? Depending on the application using those certificates, they’d all fail to validate and become useless. Wireless clients would fail to connect, smart card users would fail to authenticate, and all sorts of other bad things would occur. The goal is to prevent any career limiting outages so you shouldn’t just uninstall that CA.

    No, you should instead remove all the templates from the Certificate Templates folder using the Certification Authority MMC snap-in on the old CA. If an Enterprise CA isn’t configured with any templates it can’t issue any new certificates. On the other hand, it is still quite capable of refreshing its CRL, and this is exactly the behavior you want. Conversely, you’ll want to add those same templates you removed from the Old And Busted CA into the Certificate Templates folder on the New Hotness Issuing CA.

    If you modify the contents of the Certificate Templates folder for a particular CA, that CA’s pKIEnrollmentService object must be updated in Active Directory. That means that you will have some latency as the changes replicate amongst your domain controllers. It is possible that some user in an outlying site will attempt to enroll for a certificate against the Old And Busted CA and that request will fail because the Old And Busted CA knows immediately that it should not issue any certificates. Given time, though, that error condition will fade as all domain controllers get the new changes. If you’re extremely sensitive to that kind of failure, however, then just add your templates to the New Hotness Issuing CA first, wait a day (or whatever your end-to-end replication latency is) and then remove those templates from the Old And Busted CA. In the long run, it won’t matter if the Old And Busted CA issues a few last minute certificates.

    At this point all certificate requests within your organization will be processed by the New Hotness Issuing CA, but what about all those certificates issued by the Old And Busted CA that are still in use? Do you have to manually go to each user and computer and request new certificates? Well…it depends on how the certificates were originally requested.

    Manually Requested

    If a certificate has been manually requested then, yes, in all likelihood you’ll need to manually update those certificates. I’m referring here to those certificates requested using the Certificates MMC snap-in, or through the Web Enrollment Pages. Unfortunately, there’s no automatic management for certificates requested manually. In reality, though, refreshing these certificates probably means changing some application or service so it knows to use the new certificate. I refer here specifically to Server Authentication certificates in IIS, OCS, SCCM, etc. Not only do you need to change the certificate, but you also need to reconfigure the application so it will use the new certificate. Given this situation, it makes sense to make your necessary changes gradually. Presumably, there is already a procedure in place for updating the certificates used by these applications I mentioned, among others I didn’t, as the current certificates expire. As time passes and each of these older, expiring certificates are replaced by new certificates issued by the new CA, you will gradually wean your organization off of the Old And Busted CA and onto the New Hotness Issuing CA. Once that is complete you can safely decommission the old CA.

    And it isn’t as though you don’t have a deadline. As soon as the Old And Busted CA certificate itself has expired you’ll know that any certificate ever issued by that CA has also expired. The Microsoft CA enforces such validity period nesting of certificates. Hopefully, though, that means that all those certificates have already been replaced, and you can finally decommission the old CA.

    Automatically Enrolled

    Certificate Autoenrollment was introduced in Windows XP, and it allows the administrator to assign certificates based on a particular template to any number of forest users or computers. Triggered by the application of Group Policy, this component can enroll for certificates and renew them when they get old. Using Autoenrollment, once can easily deploy thousands of certificates very, very quickly. Surely, then, there must be an automated way to replace all those certificates issued by the previous CA?

    As a matter of fact, there is.

    As described above, the new PKI is up and ready to start issuing digital certificates. The old CA is still up and running, but all the templates have been removed from the Certificate Templates folder so it is no longer issuing any certificates. But you still have literally thousands of automatically enrolled certificates outstanding that need to be replaced. What do you do?

    In the Certificates Templates MMC snap-in, you’ll see a list of all the templates available in your enterprise. To force all holders of a particular certificate to automatically enroll for a replacement, all you need to do is right-click on the template and select Reenroll All Certificate Holders from the context menu.

    clip_image002

    What this actually does is increment the major version number of the certificate template in question. This change is detected by the Autoenrollment component on each Windows workstation and server prompting them to enroll for the updated template, replacing any certificate they may already have. Automatically enrolled user certificates are updated in the exact same fashion.

    Now, how long it takes for each certificate holder to actually finish enrolling will depend how many there are and how they connect to the network. For workstations that are connected directly to the network, user and computer certificates will be updated at the next Autoenrollment pulse.

    Note: For computers, the autoenrollment pulse fires at computer startup and every eight hours thereafter. For users, the autoenrollment pulse fires at user logon and every eight hours thereafter. You can manually trigger an autoenrollment pulse by running certutil -pulse from the command line. Certutil.exe is installed with the Windows Server 2003 Administrative Tools Pack on Windows XP, but it is installed by default on the other currently supported versions of Windows.

    For computers that only connect by VPN it may take longer for certificates to be updated. Unfortunately, there is no blinking light that says all the certificate holders have been reenrolled, so monitoring progress can be difficult. There are ways it could be done -- monitoring the certificates issued by the CA, using a script to check workstations and servers and verify that the certificates are issued from the new CA, etc. -- but they require some brain and brow work from the Administrator.

    There is one requirement for this reenrollment strategy to work. In the group policy setting where you enable Autoenrollment, you must have the following option selected: Update certificates that use certificate templates.

    clip_image003

    If this policy option is not enabled then your autoenrolled certificates will not be automatically refreshed.

    Remember, there are two autoenrollment policies -- one for the User Configuration and one for the Computer Configuration. This option must be selected in both locations in order to allow the Administrator to force both computers and users to reenroll for an updated template.

    But I Have to Get Rid of the Old CA!

    As I’ve said earlier, once you’ve configured the Old And Busted CA so that it will no longer issue certificates you shouldn’t need to touch it again until all the certificates issued by that CA have expired. As long as the CA continues to publish a revocation list, all the certificates issued by that CA will remain valid until they can be replaced. But what if you want to decommission the Old And Busted CA immediately? How could make sure that your outstanding certificates would remain viable until you can replace them with new certificates? Well, there is a way.

    All X.509 digital certificates have a validity period, a defined interval time with fixed start and end dates between which the certificate is considered valid unless it has been revoked. Once the certificate is expired there is no need to check with a certificate revocation list (CRL) -- the certificate is invalid regardless of its revocation status. Revocation lists also have a validity period during which time it is considered an authoritative list of revoked certificates. Once the CRL has expired it can no longer be used to check for revocation status; a client must retrieve a new CRL.

    You can use this to your advantage by extending the validity period of the Old And Busted CA’s CRL in the CA configuration to match (or exceed) the remaining lifetime of the CA certificate. For example, if the Old And Busted CA’s certificate will be valid for the next 4 years, 3 months, and 10 days, then you can set the publication interval for the CA’s CRL to 5 years and immediately publish it. The newly published CRL will remain valid for the next five years, and as long as you leave that CRL published in the defined CRL distribution points -- Active Directory and/or HTTP -- clients will continue to use it for checking revocation status. You no longer need the actual CA itself so you can uninstall it.

    One drawback to this, however, is that you won’t be able to easily add any certificates to the revocation list. If you need to revoke a certificate after you’ve decommissioned the CA, then you’ll need to use the command line utility certutil.exe.

    Certutil.exe -resign “Old And Busted CA.crl” +<serialNumber>

    Of course, this requires that you keep the private keys associated with the CA, so you’d better back up the CA’s keys before you uninstall the role.

    Conclusion

    Wow…we’ve covered a lot of information here, so I’ll try to boil all of it down to the most important points. First, yes you can have multiple root CAs and even multiple PKIs in a single Active Directory forest. Because of the way the objects are representing those CAs are named and stored, you couldn’t possibly experience a conflict unless you tried to give more than one CA the same CA name.

    Second, once the new PKI is built you’ll want to configure your old CA so that it no longer issues certificates. That job will now belong to the issuing CA in your new PKI.

    Third, the ease with which you can replace all the certificates issued by the old CA with certificates issued by your new CA will depend mainly on how the certificates were first deployed. If all of your old certificates were requested manually then you will need to replace them in the same way. The easiest way to do that is replace them all gradually as they expired. On the other hand, if your old certificates were deployed via autoenrollment then you can trigger all of your autoenrollment clients to replace the old certificates with new ones from the new PKI. You can do this through the Certificate Templates MMC snap-in.

    And finally, what do you do with the old CA? Well, if you don’t need the equipment you can just keep it around until it either expires or all the old certificates have been replaced. If, however, you want to get rid of it immediately you can extend the lifetime of the old CA’s CRL to match the remaining validity period of the CA certificate. Just publish a new CRL and it’ll be good until all outstanding certificates have expired. Just keep in mind that this route will limit your ability to revoke those old certificates.

    If you think I missed something, or you want me to clarify a certain point, please feel free to post in the comments below.

    Jonathan “Man in Black” Stephens

    PS: Don’t ever challenge my Office clip art skills again, Jonathan.

    image

    - Ned

  • Forcing Afterhours User Logoffs

    Mike here and today I want to answer a common customer request—how to force users to logoff at the end of the day. The scenario requires a bit of an explanation, so let’s get started.

    Let’s recognize the value of forcing users to logoff at the end of their work day, rather than simply allowing them to lock their computer. Locking their computer leaves many processes running. Running processes keep files open. Open files may introduce problems with synchronizing user data with Offline Files, home folders and distributing user content to other replica targets. Also, roaming user profiles are updated only at logoff (with the exception of Windows 7 background upload of the ntuser.dat, which must be turned on through policy). Allowing users to remain logged on after hours provides little benefit (aside from people like Ned, who does not sleep for fear of clowns may eat him).

    clip_image002
    Everybody floats down here…

    We force an after hour logoff using two Group Policy Preference Scheduled Task items. We’ll configure the items from a Windows Server 2008 R2 computer. Our targeted client computers are Windows 7 and Windows Vista. The typical business work day begins around 8am and ends between 5 and 6 pm. For this scenario, we’ll presume our workday ends at 5 pm. Our first scheduled task notifies the user the computer will shut down in 15 minutes. The second scheduled task actually shutdowns the computer.

    Notify the user

    We use the first scheduled task to notify the user they will be logged off in 15 minutes. This gives the user a reasonable amount of time to save their work. Ideally, users will save their work and logoff or shut down the computer within this allow time (once they understand their computer will log them off regardless). Our Group Policy Preference items target users; so, we’ll open GPMC and create a new Scheduled Task (Windows Vista or later) preference item.

    clip_image003

    We use the Update action for the Preference item and name the item DisplayLogoffMessage. The Update action creates the new scheduled task if it does not exist, or updates an existing task with the current configuration. Under the Security option select %LogonDomain\LogonUser% and select Run only when user is logged on.

    clip_image004

    Next, we need to configure when the event triggers. For this scenario, we want the event to trigger daily, at 5 pm. Also, ensure the status for the task is set to Enabled. Next, we’ll configure the action that occurs when the event triggers.

    clip_image005

    Select Display a message for the action. Type Afterhours Logoff in the Title box. In the Message box, type Windows will logoff your session in 15 minutes. Please save your work. Click OK.

    Force the logoff

    We’ve notified the user. Now we need actually force the logoff. We’ll use a new Schedule Task (Windows Vista or later) preference item.

    clip_image006

    We’ll configure the General tab similar to the previous preference item. We’ll use Update for the Action. The Name and Description can vary; however, understand that name is the criterion used to determine if the scheduled task exists on the applying computer. The only change we’ll make in the Triggers configuration is the time. We should configure this preference item should start at 5:15 pm.

    clip_image008

    The Action for our new preference item is going to Start a program. The program we’ll use is LOGOFF.EXE, which is included with Windows and resides in the System folder. We represent this by using a Group Policy Preference variable. In the Program/script: box, type %SystemDir%\logoff.exe. The LOGOFF.EXE program does not require any arguments.

    We should have two Scheduled Task Preference item. The DisplayLogoffMessage should be ordered first and the Force_afterhours_logoff should be second. The only remaining configuration is to link the Group Policy object hosting these preference items to a point in Active Directory so it applies to user objects.

    clip_image010

    On to the client

    Users on Windows 7 computers will process the above settings without any additional configuration. However, Windows Vista computers, including those running Service Pack 1 need the latest Group Policy Preference Client Side Extension (http://support.microsoft.com/kb/974266).

    At 5 pm, the scheduled task triggers Windows to display a message to the user.

    clip_image012

    Fifteen minutes after the message, Windows will then end all the running applications and log off the user.

    clip_image014

    This is actually the hardest part of the scenario. However, there is one additional configuration we must perform on the user account to complete the solution.

    We need to configure Logon Hours for the user. The Logon Hours should be configured to prevent the user from logging on the computer after we’ve forcefully logged them out. In this scenario, we forcefully log off the user at 5:15 pm; however, we’ve configured their user account so their logon hours deny them from logging on past 5 pm. Windows prevents the logon and displays a message to the user explaining they are not allowed to logon at this time.

    clip_image016

    Conclusion

    The scenario explains how to administratively force a user session logoff to your environment. If users are members of the local Administrators group, then all bets are off. The only way to prevent an administrator from doing something is not to make them an administrator.

    Alternatively, you can slightly modify this scenario to force a computer shutdown rather than a user logoff. Windows includes SHUTDOWN.EXE, with a variety of command arguments. This may be the most optimal form of power management because a powered down computer uses the least amount of energy. Also, forcing shutdowns will force users to save their work before leaving, which helps with making sure centralized backups have the most current and accurate user data.

    Mike “Nice Marmot” Stephens

  • The Case of the Enormous CA Database

    Hello, faithful readers! Jonathan here again. Today I want to talk a little about Certification Authority monitoring and maintenance. This topic was brought to my attention by a recent case that I had where a customer’s CA database had grown to rather elephantine proportions over the course of many months quite unbeknownst to the administrators. In fact, the problem didn’t come to anyone’s attention until the CA database had consumed nearly all of the 55 GB partition on which it resided. How many of you may be in this same situation and be completely unaware of it? Hmmm? Well, in this post, I’ll first go over the details of the issue and the steps we took to resolve the immediate crisis. In the second part, I’ll cover some processes and tools you can put in place to both maintain your CA database and also alert you to possible problems that may increase its size.

    The Issue

    Once upon a time, Roger contacted Microsoft Support and reported that he had a problem. His Windows Server 2003 Enterprise CA database, which had been given its own partition, had grown to over 50 GB in size, and was still growing. The partition itself was only 55 GB in size, so Roger asked if there is any way to compact the CA database before the CA failed due to a lack of disk space.

    Actually, compacting the CA database is a simple process, and while this isn’t a terribly common request we’re pretty familiar with the steps. What made this case so unusual was the sheer size of the database file. Previously, the largest CA database I’d ever seen was only about 21 GB, and this one was over twice that size! But no matter. The principles are the same regardless, and so we went to it.

    Compacting the CA Database

    Compacting a CA database is essentially a two-step process. The first step is to delete any unnecessary rows from the CA database. This will leave behind what we call white space in the database file that can be reused by the CA for any new records that it adds. If we just removed the unneeded records the size of the database file would not be reduced, but we could be confident that the database file would grow no larger in size.

    If the database file were smaller, this might be an acceptable solution. In this case, the size of the database file relative to the size of the partition on which it resided mandated that we also compact the database file itself.

    If you are familiar with compacting the Active Directory database on a domain controller, then you will realize that this process is identical. A new database file is created and all the active records are copied from the old database file to the new database file, thus removing any of the white space. When finished, the old database file is deleted and the new file is renamed in place with the name of the old file. While actually performing the compaction, Certificate Services must be disabled.

    At the end of this process, we should have a significantly smaller database file, and with appropriate monitoring and maintenance in the future we can ensure that it never reaches such difficult to manage proportions again.

    What to Delete?

    What rows can we safely delete from the CA database? First, you need to have a basic understanding of what exactly is stored in the CA database. When a new certificate request is submitted to the CA a new row is created in the database. As that request is processed by the CA the various fields in that row are updated and the status of each request at a particular point in time describes at what point in the process the request is. What are the possible states for each row?

    • Pending - A pending request is basically on hold until an Administrator manually approves the request. When approved, the request is re-submitted to the CA to be processed. On a Standalone CA, all certificate requests are pended by default. On an Enterprise CA, certificate requests are pended if the option to require CA Manager approval is selected in the certificate template.
    • Failed - A failed request is one that has been denied by the CA because the request isn’t suitable per the CA’s policy, or there was an error encountered while generating the certificate. One example of such an error is if the certificate template is configured to require key archival, but no Key Recovery Agents are configured on the CA. Such a request will fail.
    • Issued - The request has been processed successfully and the certificate has been issued.
    • Revoked - The certificate request has been processed and the certificate issued, but the administrator has revoked the certificate.

    In addition, issued and revoked certificates can either be time valid or expired.

    These states, and whether or not a certificate is expired, need to be taken into account when considering which rows to delete. For example, you do not want to delete the row for a time valid, issued certificate, and in fact, you won’t be able to. You won’t be able to delete the row for a time valid, revoked certificate either because this information is necessary in order for the CA to periodically build its certificate revocation list (CRL).

    Once a certificate has expired, however, then Certificate Services will allow you to delete its row. Expired certificates are no longer valid on their face, so there is no need to retain any revocation status. On the other hand, if you’ve enabled key archival then you may have private keys stored in the database row as well, and if you delete the row you’d never be able to recover those private keys.

    That leaves failed and pending requests. These rows are just requests; there are no issued certificates associated with them. In addition, while technically a failed request can be resubmitted to the CA by the Administrator, unless the cause of the original failure is addressed there is little purpose in doing so. In practice, you can safely delete failed requests. Any pending requests should probably be examined by an Administrator before you delete them. A pending request means that someone out there has an outstanding certificate request for which they are patiently waiting on an answer. The Administrator should go through and either issue or deny any pending requests to clear that queue, rather than just deleting the records.

    In this customer’s case, we decided to delete all the failed requests. But first, we had to determine exactly why the database had grown to such huge proportions.

    Fix the Root Problems, First

    Before you start deleting the failed requests from the database, you should ensure that you have addressed any configuration issues that led to these failures to begin with. Remember, Roger reported that the database was continuing to grow in size. It would make little sense to start deleting failed requests -- a process that requires that the CA be up and running -- if there are new requests being submitted to the CA and subsequently failing. The rows you delete could just be replaced by more failed rows and you’ll have gained nothing.

    In this particular case, we found that there were indeed many request failures still being reported by the CA. These had to be addressed before we could actually do anything about the size of the CA database. When we checked the application log, we saw that Certificate Services was recording event ID 53 warnings and event ID 22 errors for multiple users. Let’s look at these events.

    Event ID 53

    Event ID 53 is a warning event indicating that the submitted request was denied, and containing information about why it was denied. This is a generic event whose detailed message takes the form of:

    Certificate Services denied request %1 because %2. The request was for %3. Additional information: %4

    Where:

    %1: Request ID
    %2: Reason request was denied
    %3: Account from which the request was submitted
    %4: Additional information

    In this particular case, the actual event looked like this:

    Event Type:   Warning

    Event Source: CertSvc

    Event Category:      None

    Event ID:     53

    Date:         <date>

    Time:         <time>

    User:         N/A

    Computer:     <CA server>

    Description:

    Certificate Services denied request 22632 because The EMail name is unavailable and cannot be added to the Subject or Subject Alternate name. 0x80094812 (-2146875374).  The request was for CORP02\jackburton.  Additional information: Denied by Policy Module

    This event means that the certificate template is configured to include the user’s email address in the Subject field, the Subject Alternative Name extension, or both, and that this particular user does not have an email address configured. When we looked at the users for which this event was being recorded, they were all either service accounts or test users. These are accounts for which there would probably be no email address configured under normal circumstances. Contributing to the problem was the fact that user autoenrollment had been enabled at the domain level by policy, and the Domain Users group had permissions to autoenroll for this particular template.

    In general, one probably shouldn’t configure autoenrollment for service accounts or test accounts without specific reasons. In this case, simple User certificates intended for “real” users certainly don’t apply to these types of accounts. The suggestion in this case would be to create a separate OU wherein user autoenrollment is disabled by policy, and then place all service and test accounts in that OU. Another option is to create a group for all service and test accounts, and then deny that group Autoenroll permissions on the template. Either way, these particular users won’t attempt to autoenroll for the certificates intended for your users which will eliminate these events.

    For information on troubleshooting other possible causes of these warning events, check out this link.

    Event ID 22

    Event ID 22 is an error event indicating that the CA was unable to process the request due to an internal failure. Fortunately, this event also tells you what the failure was. This is a generic event whose detailed message takes the form of:

    Certificate Services could not process request %1 due to an error: %2. The request was for %3. Additional information: %4

    Where:

    %1: Request ID
    %2: The internal error
    %3: Account from which the request was submitted
    %4: Additional information

    In this particular case, the actual event looked like this:

    Event Type:   Error

    Event Source: CertSvc

    Event Category:      None

    Event ID:     22

    Date:         <date>

    Time:         <time>

    User:         N/A

    Computer:     <CA server>

    Description:

    Certificate Services could not process request 22631 due to an error: Cannot archive private key.  The certification authority is not configured for key archival. 0x8009400a (-2146877430).  The request was for CORP02\david.lo.pan.  Additional information: Error Archiving Private Key

    This event means that the certificate template is configured for key archival but the CA is not. A CA will not accept the user’s encrypted private key in the request if there are no valid Key Recovery Agent (KRA) configured. The fix for this is pretty simple for our current purposes; disable key archival in the template. If you actually need to archive keys for this particular template then you should set that up before you start removing failed requests from your database. Here are some links to more information on that topic:

    Key Archival and Recovery in Windows Server 2003
    Key Archival and Recovery in Windows Server 2008 and Windows Server 2008 R2

    Template, Template, Where’s the Template?

    What’s the fastest way to determine which template is actually associated with each of these events? You can find that by looking at the failed request entry in the Certification Authority MMC snap-in (certsrv.msc). If you have more than a couple hundred failed requests, however, find the one you actually want can be difficult. This is where filtering the view comes in handy.

    1. In the Certification Authority MMC snap-in, right-click on Failed Requests, select View, then select Filter….

    clip_image001

    2. In the Filter dialog box, click Add….

    clip_image002

    3. In the New Restriction dialog box, set the Request ID to the value that you see in the event, and click Ok.

    clip_image003

    4. In the Filter dialog box, click Ok.

    clip_image004

    5. Now you should see just the failed request designated in the event. Right-click on it, select All Tasks, and then select View Attributes/Extensions….

    clip_image005

    6. In the properties for this request, click on the Extensions tab. In the list of extensions, locate Certificate Template Information. The template name will be show in the extension details.

    clip_image006

    This is the name of the template whose settings you should review and correct, if necessary.

    Once the root problems causing the failed requests have been resolved, monitor the Application event log to ensure that Certificate Services is not logging any more failed requests. Some failed requests in a large environment are expected. That’s just the CA doing its job. What you’re trying to eliminate are the large bulk of the failures caused by certificate template and CA misconfiguration. Once this is complete, you’re ready to start deleting rows from the database.

    Deleting the Failed Requests

    The next step in this process is to actually delete the rows using our trusty command line utility certutil.exe. The -deleterow verb, introduced in Windows Server 2003, can be used to delete rows from the CA database. You just provide it with the type of records you want deleted and a past date (if you use a date equal to the current date or later, the command will fail). Certutil.exe will then delete the rows of that type where the date the request was submitted to the CA (or the date of expiration, for issued certificates) is earlier than the date you provide. The supported types of records are:

    Name

    Description

    Type of date

    Request

    Failed and pending requests

    Submission date

    Cert

    Expired and revoked certificates

    Expiration date

    Ext

    Extension table

    N/A

    Attrib

    Attribute table

    N/A

    CRL

    CRL table

    Expiration date

     

     

     

     

    For example, if you want to delete all failed and pending requests submitted by January 22, 2001, the command is:

    C:\>Certutil -deleterow 1/22/2001 Request

    The only problem with this approach is that certutil.exe will only delete about 2,000 - 3,000 records at a time before failing due to exhaustion of the version store. Luckily, we can wrap this command in a simple batch file that runs the command over and over until all the designated records have been removed.

    @echo off

    :Top

    Certutil -deleterow 8/31/2010 Request

    If %ERRORLEVEL% EQU -939523027 goto Top

    This batch file runs certutil.exe with the -deleterow verb. If the command fails with the specific error code indicating that the version store has been exhausted, the batch file simply loops and the command is executed again. Eventually, the certutil.exe command will exit with an ERRORLEVEL value of 0, indicating success. The script will then exit.

    Every time the command executes, it will display how many records were deleted. You may therefore want to pipe the output of the command to a text file from which you can total up these values and determine how many records in total were deleted.

    In Roger’s case, the total number of deleted records came to about 7.8 million rows. Yes…that is 7.8 million failed requests. The script above ran for the better part of a week, but the CA was up and running the entire time so there was no outage. Indeed, the CA must be up and running for the certutil.exe command to work as certutil.exe communicates with the ICertAdmin COM interface of Certificate Services.

    That is not to say that one should not take precautions ahead of time. We increased the base CRL publication interval to seven days and published a new base CRL immediately before starting to delete the rows. We also disabled delta CRLs temporarily while the script was running. We did this so that even if something unexpected happen, clients would still be able to check the revocation status of certificates issued by the CA for an extended period, giving us the luxury of time to take any necessary remediation steps. As expected, however, none were required.

    And Finally, Compaction

    The final step in this process is compacting the CA database file to remove all the white space resulting from deleting the failed requests from the database. This process is identical to defragmenting and compacting Active Directory’s ntds.dit file, as the Certificate Services uses the same underlying database technology as Active Directory -- the Extensible Storage Engine (ESE).

    Just as with AD, you must have free space on the partition equal to or greater than the database file size. As you’ll recall, we certainly didn’t have that in this case what with a database of 50 GB on a 55 GB partition. What do you do in this case? Move the database and log files to a partition with enough free space, of course.

    Fortunately, Roger’s backing store was on a Storage Area Network (SAN), so it was trivial to slice off a new 150 GB partition and move the database and log files to the new, larger partition. We didn’t even have to modify the CA configuration as Roger’s storage admins were able to just swap drive letters since the only thing on the original partition was the CertLog folder containing the CA database and log files. Good planning, that.

    With enough free space now available, all is ready to compact the database. Well…almost. You should first take the precaution of backing up the CA database prior to starting just in case something goes wrong. The added benefit to backing up the CA database is that you’ll truncate the database log files. In Roger’s case, after deleting 7.8 million records there were several hundred megabytes of log files. To back up just the CA database, run the following command:

    C:\>Certutil -backupDB backupDirectory

    The backup directory will be created for you if it does not already exist, but if it does exist, it must be empty. Once you have the backup, copy it somewhere safe. And now we’re finally ready to proceed.

    To compact the CA database, stop and then disable Certificate Services. The CA cannot be online during this process. Next, run the following command:

    C:\>Esentutl /d Path\CaDatabase.edb

    Esentutl.exe will take care of the rest. In the background, esentutl.exe will create a temporary database file and copy all the active records from the current database file to the new one. When the process is complete, the original database file will be deleted and the temporary file renamed to match the original. The only difference is that the database file should be much smaller.

    How much smaller? Try 2.8 GB. That’s right. By deleting 7.8 million records and compacting the database, we recovered over 47 GB of disk space. Your own mileage may vary, though, as it depends on the number of failed requests in your own database. To finish, we just copied the now much smaller database and log files to the original drive and then re-enabled and restarted Certificate Services.

    While very time consuming, simply due to the sheer number of failed requests in the database, overall the operation went off without a hitch. And everyone lived happily ever after.

    Preventative Maintenance and Monitoring

    Now that the CA database is back down to its fighting weight, how do you make sure you keep it that way? There are actually several things you can do, including regular maintenance and, if you have the capability, closer monitoring of the CA itself.

    Maintenance

    You’ll remember that it was not necessary to take the CA offline while deleting the failed requests. We did take precautions by modifying the CRL publication interval but fortunately that turned out to be unnecessary. Since no outage is required to remove failed requests from the CA database, it should be pretty simple to get approval to add it to your regular maintenance cycle. (You do have one, right?) Every quarter or so, run the script to delete the failed requests. You can do it more or less often as is appropriate for your own environment.

    You don’t have to compact the CA database each time. Remember, the white space will simply be reused by the CA for processing new requests. Over time, you may find that you reach a sort of equilibrium, especially if you also have the freedom to delete expired certificates as well (i.e., no Key Archival), where the CA database just doesn’t get any bigger. Rows are deleted and new rows are created in roughly equal numbers, and the space within the database file is reused over and over -- a state of happy homeostasis.

    If you want, you can even use scheduled tasks to automatically perform this maintenance every three months. The batch file above can be modified to run using VBScript or even PowerShell. Simply add some code to email yourself a report when the deletion process is finished; there are plenty of code samples available on the web for sending email using both VBScript and PowerShell. Bing it!

    Monitoring

    In addition to this maintenance, you can also use almost any monitoring or management software to watch for certain key events on the CA. Those key events? I already covered two of them above -- event IDs 53 and 22. For a complete list of events recorded by Certificate Services, look here.

    If you have Microsoft Operations Manager (MOM) 2005 or System Center Operations Manager (SCOM) 2007 deployed, and you have Windows Server 2008 or Windows Server 2008 R2 CAs, then you can download the appropriate management pack to assist you with your monitoring.

    MOM 2005: Windows Server 2008 Active Directory Certificate Services Management Pack for Microsoft OpsMgr 2005
    SCOM 2007 SP1: Active Directory Certificate Services Monitoring Management Pack

    The management packs encompass event monitoring and prescriptive guidance and troubleshooting steps to make managing your PKI much simpler. These management packs are only supported for CAs running on Windows Server 2008 or higher, so this is yet one more reason to upgrade those CAs.

    Conclusion

    Like any other infrastructure service in your enterprise environment, the Windows CA does require some maintenance and monitoring to maintain its viability over time. If you don’t pay attention to it, you may find yourself in a situation similar to Roger’s, not noticing the problem until it is almost too late to do anything to prevent an outage. With proper monitoring, you can become aware of any serious problems almost as soon as they begin, and with regular maintenance you prevent such problems from ever occurring. I hope you find the information in this post useful.

    Jonathan “Pork Chop Express” Stephens

  • Friday Mail Sack: Mostly Edge Case Edition

    Hello all, Ned here again with this week’s conversations between AskDS and the rest of the world.  Today we talk Security, ADWS, FSMO upgrades, USMT, and why “Web 2.0 Internet” is still a poisonous wasteland of gross.

    Let’s do it to it.

    Question

    I am getting questions from my Security/Compliance/Audit/Management folks about what security settings we should be applying on XP/2003/2008/Vista/7. Are there Microsoft recommendations? Are there templates? Are there explanations of risk versus reward? Could some settings break things if I’m not careful? Can I get documentation in whitepaper and spreadsheet form? Do you also have these for Office 2007 and Internet Explorer? Can I compare to my current settings to find differences?

    [This is another of those “10 times a week” questions, like domain upgrade – Ned]

    Answer

    Yes, yes, yes, yes, yes, yes, and yes. Download the Microsoft Security Compliance Manager. This tool has all the previously scattered Microsoft security documentation in one centralized location, and it handles all of those questions. Microsoft provides comparison baselines for “Enterprise Configuration” (less secure, more functional) and “Specialized Security-Limited Functionality” (more secure, less usable) modes, within each Operating System. Those are further distinguished by role and hardware – desktops, laptops, domain controllers, member servers, users, and the domain itself.

    image

    So if you drill down into the settings and tabs of a given setting, you see more details, explanations, and reasoning on why you might want to choose something or not.

    image  image  image

    It also has further docs and allows you to completely export the settings as GPO, DCM, SCAP, INF, or Excel.

      image  image[36]

    It’s slick stuff. I think we got this right and the Internet’s “shotgun documentation” gets this wrong.

    Question

    Is it ok to have FSMO roles running on a mixture of operating systems? For example, a PDC Emulator on Windows Server 2003 and a Schema Master on Windows Server 2008?

    Answer

    Yes, it’s generally ok. The main issue people typically run into is that the PDCE is used to create special groups by certain components and if the PDC is not at that component’s OS level, the groups will not be created.

    For example, these groups will not get created until the PDCE role moves to a Win2008 or later DC:

    • SID: S-1-5- 21 domain –498
      Name: Enterprise Read-only Domain Controllers
      Description: A Universal group. Members of this group are Read-Only Domain Controllers in the enterprise
    • SID: S-1-5- 21 domain -521
      Name: Read-only Domain Controllers
      Description: A Global group. Members of this group are Read-Only Domain Controllers in the domain
    • SID: S-1-5-32-569
      Name: BUILTIN\Cryptographic Operators
      Description: A Builtin Local group. Members are authorized to perform cryptographic operations.
    • SID: S-1-5-21 domain –571
      Name: Allowed RODC Password Replication Group
      Description: A Domain Local group. Members in this group can have their passwords replicated to all read-only domain controllers in the domain.
    • SID: S-1-5- 21 domain -572
      Name: Denied RODC Password Replication Group
      Description: A Domain Local group. Members in this group cannot have their passwords replicated to any read-only domain controllers in the domain
    • SID: S-1-5-32-573
      Name: BUILTIN\Event Log Readers
      Description: A Builtin Local group. Members of this group can read event logs from local machine.
    • SID: S-1-5-32-574
      Name: BUILTIN\Certificate Service DCOM Access
      Description: A Builtin Local group. Members of this group are allowed to connect to Certification Authorities in the enterprise.

    And those groups not existing will prevent various Win2008/Vista/R2/7 components from being configured. From the most boring KB I ever had to re-write:

    243330  Well-known security identifiers in Windows operating systems - http://support.microsoft.com/default.aspx?scid=kb;EN-US;243330

    I hesitate to ask why you wouldn’t want to move these FSMO roles to a newer OS though.

    Question

    Every time I boot my domain controller it logs this warning:

    Log Name:      Active Directory Web Services
    Source:        ADWS
    Date:          6/26/2010 10:20:22 PM
    Event ID:      1400
    Task Category: ADWS Certificate Events
    Level:         Warning
    Keywords:      Classic
    User:          N/A
    Computer:      mydc.contoso.com
    Description:
    Active Directory Web Services could not find a server certificate with the specified certificate name. A certificate is required to use SSL/TLS connections. To use SSL/TLS connections, verify that a valid server authentication certificate from a trusted Certificate Authority (CA) is installed on the machine. 
    Certificate name: mydc.contoso.com

    It otherwise works fine and I can use ADWS just fine. Do I care about this?

    Answer

    Only if you:

    1. You think you have a valid Server Authentication certificate.
    2. Want to use SSL to connect to ADWS.

    By default Windows Server 2008 R2 DC’s will log this warning until they get issued a valid server certificate (which you get for free once you deploy an MS Enterprise PKI, by getting a Domain Controller certificate through auto-enrollment). Once that happens you will log a 1401 and never see this warning again.

    If you think you have the right certificate (and in this case, the customer thought he did - it had EKU of Server Authentication (1.3.6.1.5.5.7.3.1), the right SAN, and chained fine), compare it to a valid DC certificate issued by an MS CA. You can do all this in a test lab even if you’re not using our PKI by just creating a default PKI “next next next” style and examining an exported DC certificate. When we compared the exported certificates, we found that his 3rd-party issued cert was missing a Subject entry, unlike my own. We theorized that this might be it – the subject is not required for a cert to be valid, but any application can decide it’s important and it’s likely ADWS does.

    Question

    Seeing this error when doing a USMT 4.0 migration:

    [0x080000] HARDLINK: cannot find distributed store for d - cee6e189-2fd2-4210-b89a-810397ab3b7f[gle=0x00000002]
    [0x0802e3] SelectTransport: OpenDevice failed with Exception: Win32Exception: HARDLINK: cannot find all distributed stores.: There are no more files. [0x00000012] void __cdecl Mig::CMediaManager::SelectTransportInternal(int,unsigned int,struct Mig::IDeviceInitializationData *,int,int,int,unsigned __int64,class Mig::CDeviceProgressAdapter *)

    We have a C: and D: drive and when we run the migration we use these steps:

    1. Scanstate with hard-link for both drives.
    2. Delete the D: drive partition and extend out C: to use up that space.
    3. Run the loadstate.

    If we don’t delete the D: partition it works fine. I thought all the data was going into the hard-link store on “C:\store”?

    Answer

    Look closer. :) When you create a hard-link store and specify the store path, each volume gets its own hard-link store. Hard-links cannot cross volumes.

    For example:

    Scanstate /hardlink c:\USMTMIG […]

    Running this command on a system that contains the operating system on the C: drive and the user data on the D: drive will generate migration stores in the following locations:

    C:\USMTMIG\
    D:\USMTMIG\

    The store on C: is called the “main store” and the one on the other drive is called the “distributed store”. If you want to know more about the physicality and limits of the hard-link stores, review: http://technet.microsoft.com/en-us/library/dd560753(WS.10).aspx.

    Now, all is not lost – here are some options to get around this:

    1. You could not delete the partition (duh).

    2. You could move all data from the other partition to your C: drive before running scanstate and get rid of that partition before running scanstate.

    3. You could run the scanstate as before, then xcopy the D: drive store into the C: drive store, thereby preserving the data. For example:

    a. Scanstate with hard-link.

    b. Run:

      xcopy /s /e /h /k d:\store\* c:\store
      rd /s /q d:\store <-- this step optional. After all, you are deleting the partition later!

    c. Delete the the D: partition and extend C: like you were doing before.

    d. Run loadstate.

    There may be other issues here (after all, some application may have been pointing to files on D: and is now very angry) so make sure your plan takes that into consideration. You may need to pay a visit to <locationModify>.

    ===

    Finally

    The Black Hat Vegas USA 2010 folks have published their briefings and this one by Ivan Ristic from Qualys really struck me:

    State of SSL on the Internet: 2010 Survey, Results and Conclusions
    https://media.blackhat.com/bh-us-10/presentations/Ristic/BlackHat-USA-2010-Ristic-Qualys-SSL-Survey-HTTP-Rating-Guide-slides.pdf

    Some mind-blowingly disappointing interesting nuggets from their survey of 867,361 certificates being used by websites:

    • Only 37% of domains responded when SSL was attempted (the rest were all totally unencrypted)
    • 30% of SSL certificates failed validation (not trusted, not chained, invalid signature)
    • 50% of the certs supported insecure SSL v2 protocol
    • 56% of servers supported weak (less than 128-bit) ciphers

    Definitely read the whole presentation, it’s worth your time. Any questions, ask Jonathan Stephens.

    image
    Wooo, fancy hat. Looking sharp, Jonathan!

     

    That’s all folks, have a nice weekend.

    - Ned “I’m gonna pay for that one” Pyle

  • Using AD Recycle Bin to restore deleted DNS zones and their contents in Windows Server 2008 R2

    Ned here again. Beginning in Windows Server 2008 R2, Active Directory supports an optional AD Recycle Bin that can be enabled forest-wide. This means that instead of requiring a System State backup and an authoritative subtree restore, a deleted DNS zone can now be recovered on the fly. However, due to how the DNS service "gracefully" deletes, recovering a DNS zone requires more steps than a normal AD recycle bin operation.

    Before you roll with this article, make sure you have gone through my article here on AD Recycle Bin:

    The AD Recycle Bin: Understanding, Implementing, Best Practices, and Troubleshooting

    Note: All PowerShell lines are wrapped; they are single lines of text in reality.

    Restoring a deleted AD integrated zone

    Below are the steps to recover a deleted zone and all of its records. In this example the deleted zone was called "ohnoes.contoso.com" and it existed in the Forest DNS Application partition of the forest “graphicdesigninstitute.com”. In your scenario you will need to identify the zone name and partition that hosted it before continuing, as you will be feeding those to PowerShell. 

    1. Start PowerShell as an AD admin with rights to all of DNS in that partition (preferably an Enterprise Admin) on a DC that hosted the zone and is authoritative for it.

    2. Load the AD modules with:

    Import-Module ActiveDirectory

    3. Validate that the deleted zone exists in the Deleted Objects container with the following sample PowerShell command:

    get-adobject -filter 'isdeleted -eq $true -and msds-lastKnownRdn -eq "..Deleted-ohnoes.contoso.com"' -includedeletedobjects -searchbase "DC=ForestDnsZones,DC=graphicdesigninstitute,DC=com" -property *

    Note: the zone name was changed by the DNS service to start with "..-Deleted-", which is expected behavior. This behavior means that when you are using this command to validate the deleted zone you will need to prepend whatever the old zone name was with this "..Deleted-" string. Also note that in this sample, the deleted zone is in the forest DNS zones partition of a completely different naming context, just to make it interesting.

    4. Restore the deleted zone with:

    get-adobject -filter 'isdeleted -eq $true -and msds-lastKnownRdn -eq "..Deleted-ohnoes.contoso.com"' -includedeletedobjects -searchbase "DC=ForestDnsZones,DC=graphicdesigninstitute,DC=com" | restore-adobject

    Note: the main changes in syntax now are removing the "-property *" argument and pipelining the output of get-adobject to restore-adobject.

    5. Restore all child “DNSnode” objects of the recovered zone with:

    get-adobject -filter 'isdeleted -eq $true -and lastKnownParent -eq "DC=..Deleted-ohnoes.contoso.com,CN=MicrosoftDNS,DC=ForestDnsZones,DC=graphicdesigninstitute,DC=com"' -includedeletedobjects -searchbase "DC=ForestDnsZones,DC=graphicdesigninstitute,DC=com" | restore-adobject

    Note: the "msds-lastKnownRdn" has now been removed and replaced by "lastKnownParent", which is now pointed to the recovered (but still mangled) version of the domain zone. All objects with that as a previous parent will be restored to their old location. Because DNS stores all of its node values as flattened leaf objects, the structure of deleted records will be perfectly recovered.

    6. Rename the recovered zone back to its old name with:

    rename-adobject "DC=..Deleted-ohnoes.contoso.com,CN=MicrosoftDNS,DC=ForestDnsZones,DC=graphicdesigninstitute,DC=com" -newname "ohnoes.contoso.com"

    Note: the rename operation here is just being told to remove the old "..Deleted-" string from the name of the zone. I’m using PowerShell to be consistent but you could just use ADSIEDIT.MSC at this point, we’re done with the fancy bits.

    7. Restart the DNS service or wait for it to figure out the zone has recovered (I usually had to restart the service in repros, but then once it worked by itself for some reason – maybe a timing issue; a service restart is likely your best bet). The zone will load without issues and contain all of its recovered records.

    Special notes

    If the deleted zone was the delegated _msdcs zone (or both the primary zone and delegated _msdcs zone were deleted and you now need to get the _msdcs zone back):

    a. First restore the primary zone and all of its contents like above.

    b. Then restore the _msdcs zone like in step 4 (with no contents).

    c. Next, restore all the remaining deleted _msdcs records using the lastKnownParent DN which will now be the real un-mangled domain name of that zone. When done in this order, everything will come back together delegated and working correctly.

    d. Rename it like in step 6.

    Note: If you failed to do step c before renaming the zone because you want to recover select records, the recovered zone will fail to load. The DNS snap-in will display the zone but selecting the zone will report “the zone data is corrupt”. This error occurs because the “@” record is missing. If this record was not restored prior to the rename simply rename the zone back to “..Deleted-“, restore the “@” record, rename the zone once more and restart the DNS Server service. I am intentionally not giving a PowerShell example here as I want you to try all this out in your lab, and this will get you past the “copy and paste” phase of following the article. The key to the recycle bin is getting your feet wet before you have the disaster!

    A couple more points

    • If the zones were deleted outside of DNS (i.e. not using DNS tools) then the renaming steps will be unnecessary and you can just restore it normally. If that happens someone was really being a goof ball.
    • The AD Recycle Bin can only recover DNS zones that were AD-integrated; if the zones were Standard Primary and stored in the old flat file format, I cannot help you.
    • I have no idea why DNS has this mangling behavior and asking around the Networking team didn’t give me any clues. I suspect it is similar to the reasoning behind the “inProgress” zone renaming that occurs when a zone is converted from standard primary to AD Integrated, in order to somehow make the zone invalid prior to deletion, but… it’s being deleted, so who could care? Meh. If someone really desperately has to know, ping me in Comments and I’ll see about a code review at some point. Maybe.

    As always, you can also “just” run an authoritative subtree restore with your backups and ntdsutil.exe also. If you think my steps looked painful, you should see those. KB’s don’t get much longer.

    - Ned “let’s go back to WINS” Pyle

  • Fine-Grained Password Policy and “Urgent Replication”

    Hi folks, Ned here again. Today I discuss the so-called “urgent replication” of AD, specifically around Fine-Grained Password Policies.

    Some background

    If you’ve read the excellent guide on how AD Replication works, you have probably come across the section around so-called “urgent replication”:

    Certain important events trigger replication immediately, overriding existing change notification. Urgent replication is implemented immediately by using RPC/IP to notify replication partners that changes have occurred on a source domain controller. Urgent replication uses regular change notification between destination and source domain controller pairs that otherwise use change notification, but notification is sent immediately in response to urgent events instead of waiting the default period of 15 seconds.

    •  
      • Assigning an account lockout, which a domain controller performs to prohibit a user from logging on after a certain number of failed attempts.
      • Changing a Local Security Authority (LSA) secret, which is a secure form in which private data is stored by the LSA (for example, the password for a trust relationship).
      • Changing the password on a domain controller computer account.
      • Changing the relative identifier (known as a “RID”) master role owner, which is the single domain controller in a domain that assigns relative identifiers to all domain controllers in that domain.
      • Changing the account lockout policy.
      • Changing the domain password policy.

    So as long as the connection between the DC’s had Change Notification enabled, changing one of these special data types “urgently” replicated that change to immediately connected partners. Ordinarily this just meant DC’s in your own site, unless you have configured Inter-Site Change Notification on your Site Links. This is the part that confused most folks: urgent replication isn’t so much for security as for consistency. By default, these “urgent” changes might take a few hours or days to transitively reach outlying DC’s but maybe you don’t care because the end user experience would be consistent within every AD site.

    Suspiciously absent from the documentation though: Fine-Grained Password Policies. Does this mean that we didn’t update this old article, or that FGPP’s don’t count for urgent replication? After all, FGPP has account and password policies out the wazoo, that’s the whole point of them.

    Figuring it out

    When I first thought about writing this this article I figured I’d just look at source code, get an answer, make a three line blog post and be on my way. Except that unlike me, you don’t have that source code privilege, so that’s not super helpful. Instead I’ll show you how to determine the behavior yourself; it may be helpful in other scenarios someday.

    Let’s do this.

    1. You will be making changes on the PDC Emulator DC. You will also need to pick out a DC that directly replicates inbound from the PDCE within the same AD Site. Obviously, you better create a FGPP in this test domain you are using; it doesn’t need to be assigned to anyone. If you’re using Windows Server 2008 R2 you can load up PowerShell and quickly create a password settings object with:

    import-module activedirectory

    New-ADFineGrainedPasswordPolicy -Name "DomainUsersPSO" -Precedence 500 -ComplexityEnabled $true –Description "Test Domain Users Password Policy" -DisplayName "Domain Users PSO" –LockoutDuration "0.12:00:00" -LockoutObservationWindow "0.00:15:00" -LockoutThreshold 10

    2. Turn on Active Directory Diagnostic Event Logging for replication events on that downstream partner DC.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics
    “5 Replication Events” [REG_DWORD] = 5

    3. Pick some trivial object on the PDCE to modify (I change a user’s "Description" attribute). Use repadmin.exe /showmeta to see what its current USN is for that description attribute:

    image

    4. Change the description. After 15 seconds the change replicates to the downstream partner DC:

    image

    5. If you look in the Directory Services event log on my downstream server, you can also see that there was a USN update as a 1364 event, from the old USN to the new one. So in my example above, the old USN was 692689 and the new one is 692706. There is also a 1412 event, more on that later. The event log reflects this also with my USN vector raising exactly 15 seconds after the originating time:

    image

    Note: I am using Hyper-V guests here so I have perfect time sync. You may not be this fortunate in your lab. :)

    6. Now you change one of the known “urgent replication” settings. For example, the account lockout threshold:

    image

    7. Neato. This time you don’t get a 1364 event. You still get a 1412 event that has the right USN (so did the Description change previously, not that it matters). But where is the 1364?

    image

    You don’t get that event because the normal change notification process was bypassed and you’re in the “urgent replication” code path. This is the key indicator that you are using urgent replication, as there is no instrumentation for it. If you choose any of the various “urgent replication” data types and try this, they will all behave the same way.

    8. So now we’re pretty confident that getting a 1364 means normal replication and not getting one means urgent replication. So back to the original question – does FGPP follow urgent change replication? Find out: change an FGPP PSO to alter its settings for account lockout threshold.

    image

    image

    As you can see, FGPP does not use urgent replication. It is treated just like everyone else and showed up roughly 15 seconds later.

    But whhhhyyyyy?

    Back when Windows 2000 Active Directory was released we were paralyzed with fear that replication traffic would overwhelm networks with massive RPC data storms and everyone would hate us. So Win2000 DC’s took 5 minutes for even intra-site replication to catch up between DC’s. What we found was that replication was low bandwidth already and customers weren’t changing that much data – but when they did change data, they wanted it on all DC’s faster. So intra-site replication became 15 seconds in Win2003 and we started telling everyone through Support cases, MCS engagements, and PFE ADRAPS to turn on change notification inter-site also. This so-called “urgent replication” mechanism was designed to quickly catch up servers for “more important” changes. But since now everything happens in a few seconds it’s mostly pointless overkill and urgent replication no longer gets any new lovin’.

    So there you go.

    - Ned “don’t forget to turn that logging off when you’re done” Pyle

  • Friday Mail Sack: Scooter Edition

    Hi folks, Ned here again. It’s that time where we look back on the more interesting questions and comments the DS team got this week. Today we talk about FRS, AD Users and Computers, Load-Balancers, DFSR, DFSN, AD Schema extension, virtualization, and Scott Goad.

    Let’s ride!

    Question

    If you get a journal wrap when using FRS, there is an event 13568 like so:

    Event Type: Warning
    Event Source: NtFrs
    Event Category: None
    Event ID:
    13568
    Date: 12/12/2001
    Time: 2:03:32 PM
    User: N/A
    Computer: DC-01
    Description:
    The File Replication Service has detected that the replica set " 1 " is in JRNL_WRAP_ERROR.
    <snipped out>
    Setting the "Enable Journal Wrap Automatic Restore" registry parameter to 1 will cause the following recovery steps to be taken to automatically recover from this error state.

    But when I review KB292438 (Troubleshooting journal_wrap errors on Sysvol and DFS replica sets) it specifically states:

    Important Microsoft does not recommend that you use this registry setting, and it should not be used post-Windows 2000 SP3. Appropriate options to reduce journal wrap errors include:

    • Place the FRS-replicated content on less busy volumes.
    • Keep the FRS service running.
    • Avoid making changes to FRS-replicated content while the service is turned off.
    • Increase the USN journal size.

    So which is it?

    Answer

    The KB is correct, not the event log message. If you enable the registry setting you can get caught in a journal wrap recovery “loop” where the root cause keeps happening and getting fixed, but then happens again immediately and gets fixed, and so on: replication may sort of work – inconsistently – and you are just masking the greater problem. You should be fixing the real cause of the journal wraps.

    As to why this message is still there after 10 years and four operating systems? Inertia and our unwillingness to incur the test/localization cost of changing the event. When you have to rewrite something in all these regions and languages, the price really adds up. I am way more likely to get a bug fix from the product group that changes complex code than one that changes some text.

    Question

    I was wondering if it is intentional that the "attribute editor" tab is not visible when you use "Find" on an object in AD Users and Computers?

    Answer

    Ughh. Nope, that’s a known issue. Unfortunately for you, the business justification to fix it was not convincing. This happens in Win2008/Vista also and no Premier customer has ever put up a real struggle.

    However, you have another option: Use the “Find” in ADAC (aka AD Admin Center, aka DSAC.EXE). This lets you find and when you open those users, you will see the attribute editor property sheet. If everyone here hasn’t already figured it out, ADAC is the future due to its PowerShell integration and ADUC doesn’t appear to be getting any further love.

    Question

    Are there any issues with putting DC’s behind load-balancers?

    Answer

    If you put a domain controller behind a load balancer you will often find that LDAP/S or Kerberos authentication fail. Keep in mind that SPN’s can only be associated to one computer account, so Kerberos is going to go kaput. You will have to issue certificates manually to the domain controllers if you are trying to do LDAP/S connectivity because the subject and subject alternative name needs to match the DNS name of the load-balanced address.

    Domain controllers are load balanced already in that there are multiples of them. If you need to find a domain controller correctly your application should do a DCLocator or LDAP SRV record lookup like a proper citizen.

    Answer courtesy of Rob “Sasquatch” Greene, our tame authentication yeti.

    image

    Question

    The documentation on DFSR's cross-file RDC is pretty unclear – do I need two Enterprise Edition servers or just one? Also, can you provide a bit more detail on what cross-file RDC does?

    Answer

    Just one of the two servers in a given partnership – i.e. replicating with DFSR connections – needs to be running Enterprise Edition in order to have both servers use cross-file RDC. Proof. There is no difference in DFSR in Standard Edition versus Enterprise Edition code; once the servers agree that at least one of them is Enterprise, both will use cross-file RDC. Otherwise, anytime you got a hotfix from us there’d be one for each edition, right? But there never are: http://support.microsoft.com/kb/968429 (and yes, this article has gotten a bit out of sync with reality, we’re working on that.)

    As for what Cross-File RDC does: if you are already familiar with normal Remote Differential Compression, you understand that it takes a staged and compressed copy of a file and creates MD-4 signatures based on “chunks” of files:

    image

    This means that when a file is altered (even in the middle), we can efficiently see which signatures changed and then just send along the matching data blocks. So a doc that’s 50MB that changes one paragraph only replicates a few KB. An overall SHA-1 hash is used for the entire file - to include attributes, security info, alternate data streams etc. - as a way to know that two files match perfectly or not. DFSR can also make signatures of signatures, up to 8 levels deep, to more efficiently handle very large changes in a big file.

    Cross-file RDC takes this slightly further: by using a special hidden sparse file (located in <drive>:\system volume information\dfsr\similaritytable_1) to track all these signatures, we can use other similar files that we already have to build our copy of a new file locally. Up to five of these similar files can be used. So if an upstream server says “I have file X and here are its RDC signatures”, we the downstream server can say “ah, I don’t have that file X. But I do have files Y and Z that have some of the same signatures, so I’ll grab data from them locally and save you having to transmit it to me over the wire.” Since files are often just copies of other files with a little modification, we gain a lot of over-the-wire efficiency and minimize bandwidth usage.

    Slick, eh?

    Question

    I’m seeing DFS namespace clients going out of site for referrals. I’ve been through this article “What can cause clients to be referred to unexpected targets.” Is there anything else I’m missing?

    Answer

    There has been an explosion of so-called “WAN optimizer” products in the past few years and it seems like everyone’s buying them. The devices can be very problematic to DFS namespace clients, as the devices tend to use Network Address Translation (NAT). This means that they change the IP header info on all your SMB packets to match the subnets of the appliance endpoints – and that means that when DFS tries to figure out your subnet to give you the nearest targets, it gets the subnet of the WAN appliance, not you. So you end up using DFS targets in a totally different site, defeating the purpose of DFS in the first place – a WAN de-optimizer. :)

    A double-sided network capture will show this very clearly – packets that leave one computer will arrive at your DFS root server with a completely different IP address. Reconfigure the WAN appliance not to do this or contact their vendor about other options.

    Question

    I have created/purchased a product that will extend my active directory schema. Since it was not made or tested by Microsoft, I am understandably nervous that I am about to irrevocably destroy my AD universe. How can I test out the LDF file(s) that will be modifying my schema to ensure it is not going to ruin my weekend?

    Answer

    What you need is the free AD Schema Extension Conflict Analyzer. This script can be run anywhere you have installed PowerShell 2.0 and does not require you to use AD PowerShell (for all you late bloomers that have not yet rolled out Win7/R2).

    All you do is point this script at your LDF file(s) and your AD schema and let it decide how things look:

    set-executionpolicy unrestricted

    C:\temp\ADSchemaExtensionConflictAnalyzer.ps1 -inputfile D:\scratch\FooBarExtend-ned.ldf -outputfile results.txt

    image

    image

    image

    It will find syntax errors, mismatched attribute data types, conflicting objects, etc. plus give advice. Like here it warned me that my new attributes will be in the Global Catalog (in the “partial attribute set”). The script makes no changes to your production forest at all, but if you’re nervous anyway you can export your production schema with:

    ldifde.exe –f myschema.ldf –d cn=schema,cn=configuration,dc=contoso,dc=com

    … and have the script just compare the two files (if you’re paying attention you’ll see it call LDIFDE in a separate console window already though. You big baby.).

    Question

    I <blah blah blah> Windows <blah blah blah> running on VMWare.

    Answer

    You must be made of money, Jack. You’re already paying us for the OS you’re running everywhere. Then instead of using our free hypervisor and way less expensive management system you’re paying someone else a bunch of dough.

    “But Ned, we want dynamic memory usage, Linux support, and instantaneous guest migration between hosts”.

    Ok:

    If you really want to give your CFO a coronary, try this link: http://www.microsoft.com/virtualization/en/us/cost-compare-calculator.aspx

    Then while the EMT’s are working on him to start his ticker back up, take out your CIO with this:

    Support policy for Microsoft software running in non-Microsoft hardware virtualization software
    http://support.microsoft.com/kb/897615/

    … Microsoft will support server operating systems subject to the Microsoft Support Lifecycle policy for its customers who have support agreements when the operating system runs virtualized on non-Microsoft hardware virtualization software. This support will include coordinating with the vendor to jointly investigate support issues. As part of the investigation, Microsoft may still require the issue to be reproduced independently from the non-Microsoft hardware virtualization software.

    This is more common that you might think, we find VMware-only issues all the time and our customer is now up a creek. There are troubleshooting steps - especially with debugging - that we simply cannot do at all due to the VMware architecture. Hence why you will need to reproduce on physical hardware or hyper-v, where we can gather data. Although when we find that it no longer repro’s off VMware… now what?

    And of course, when all those VMware ESX servers stopped working for 2 days last year, their workaround could not be performed on DCs as it involved rolling back time. I know that sounds like schadenfreude, but when a customer’s DCs all go offline, we get called in even if it’s nothing to do with us - just ask me how it was when McAfee and CA decided to delete core Windows files. Spoiler alert: it blows.

    I feel strongly about this…

     

    Finally, I want to welcome Scott Goad to our fold – you have probably noticed that the KB/Blog aggregations have started again. If you look carefully you’ll see that Scott has taken that over from Craig Landis, who has moved on to getting us better equipped to support ADFS 2.0. Scott used to be a cop and he also has been working on those podcast pieces with Russ.

    image
    Naturally, Office has clipart for that precise scenario

    Welcome Scooter and thanks for all the hard work Craig!

    - Ned “I’ll let you try my Clip-Tang style!” Pyle