• What to do with FSMO roles...

    We recently hired a new engineer to a team which manages some of the internal MS environments...  We were discussing FSMO role placement and he sent me mail (snippet below slightly edited) which I thought was interesting...

    The reason why we separated the roles at my last company was due to the FSMO role seizure process. You are correct, although the server is still a single point of failure, we can mitigate this single point of failure by placing the forest roles on one box and the domain roles on another. In the event that we unexpectedly lose a DC that is either a forest or domain FSMO role holder, the process of seizing the roles is minimized (less roles to seize). Also, it had been our experience that the forest roles aren't really used that often. You are correct, FSMO roles are still a single point of failure, however, unless we really need to perform any forest related “stuff”, the single point of failure (from a forest FSMO perspective) is a non-issue. This is not the case with the domain FSMO roles, specifically the PDCE. At my last company, we felt that due to the PDCE functions it was necessary to place the domain FSMO roles on a separate box...

    I wanted to share this, because it reminded me of a FSMO related interview question which I've used in some variation or another:

    Suppose you're paged in the middle of the night and told that one of the 150 domain controllers in your single domain forest crashed.  You're first thought is likely "So what, I'll deal with it in the morning." but then you remember it's the one holding all 5 FSMO roles.  If you could only pick one FSMO role to sieze, which one would let you go back to sleep without worrying about the next day?

    There are many people that I've asked this question to...the large majority of who answered, "The Schema Master, because without the schema the AD can't function."...  Hopefully they aren't reading this blog from their whichever other job they landed...

    So back to the the whole FSMO single-point of failure and redundancy thing...

    I figured there were 2 possible reasons that they arrived at the idea that seperating FSMO roles based on forest/domain division was logical:

    1. There was some sort of fault tolerance between FSMO roles which could be preserved in a failure
    2. There was some urgency (specifically user impact) to getting a role holder back online immediately should a failure occur

    The first reason is obviously false.  FSMO is the "Flexible Single Master Operations" with the emphasis on "single"...the whole point of these roles was that even though Active Directory is a distributed system, there were just some things that could only be done in one place at a time.  So let's just take the generally accepted knowledge that each FSMO role provides specific functionality which only exists in that role.

    The second reason takes a bit more thought, but what really happens when a FSMO role holder fails?  Looking at each role, what the impact of it being offline is, and the urgency:

    Schema Master – Schema updates are not available – These are generally planned changes, and the first step when doing a schema change is normally something like "make sure your environment is healthy".  There isn't any urgency if the schema master fails, having it offline is largely irrelevant until you want to make a schema change.

    Domain Naming Master – No new domains or application partitions can be added – This sort of falls into the same "healthy environment" bucket as the schema master.  I don't know of anyone who has just randomly decided to add a new domain to a forest without much thought or planning...of course, then again, I don't know all that many people either...  You might wonder why I mentioned app partitions there as well...personal experience.  When we upgraded the first DC to a beta Server 2003 OS which included the code to create the DNS application partitions, we couldn't figure why they weren't instantiated...until we realized that the server hosting the DNM was offline (being upgraded) at the same time.  Sure enough, it came online and there they were...  But I've never said we were perfect here...

    Infrastructure Master – No cross domain updates, can't run any domain preps – Domain preps are planned (again)...But no cross-domain updates.  Hmmm...that could be important if you have a multi-domain environment with a lot of changes occurring...but wait...the IM tasks are throttled to run over a 2 day period (by default), so how much urgency does that really imply?  I guess you'd have to call it as you see it in your environment but it's probably not 3am urgency...for my buddy the new engineer, he's only working in single domain forests anyways, so urgency = zero.

    RID Master – New RID pools unable to be issued to DC's – This gets a bit more complicated, but let me see if I can make it easy.  Every DC is initially issued 500 RID's.  When it gets down to 50% (250) it requests a second pool of RID's from the RID master.  So when the RID master goes offline, every DC has anywhere between 250 and 750 RIDs available (depending on whether it's hit 50% and received the new pool).  So the urgency question is how long will it take your environment to exhaust the RIDs on a given DC?  My guess is that in most environments, this isn't that urgent either.  Oh yeah, and don't forget that if you do seize the RID master during a failure...that's an automatic flatten and rebuild of the server, you can't bring it back online.

    PDC – Time, logins, pw changes, trusts – So we made it to the bottom of the list, and by this point you've figured that the PDC has to be the most urgent FSMO role holder to get back online...the rest of them can be offline for varying amounts of time with no impact at all...so what about this one?  Yes, you should get the PDC back online whenever you can but it's not even something that I'd jump out of bed to do...let's call it the "first thing in the morning" list.  Time synch's are important, but w32time does a pretty good job and nothings going to diverge between today and tomorrow enough to impact you...users may see funky behavior if they changed their password, but replication will probably have completed before they call the help desk so nothing to worry about, and trust go back to that whole "healthy forest" thing again...  The biggest impact we see internally at Microsoft from the PDC being offline are all of the applications which were written in NT4.0 timeframe that are biased towards it.  Now that's something to consider.

    So when it really comes down to it, is there any benefit to seperating the forest and domain roles onto seperate servers?  Probably not...is there any harm in it?  Nahh...let's just chalk it up to "operational preference" since the guys who are watching this stuff day to day need to be comfortable with the way the environments are configured.

    Pop Quiz Time:

    Raise your hand, if when your phone rings in the middle of the night and you get that call...you transfer the PDC role and go back to sleep...

    ...

    ...

    now keep your hand in the air if you reconfigured the server that you transferred the role to, to also be authoritative for time?  I think I found a topic for my next blog...

    If I don't see you before then, Merry Christmas, Happy Holidays...or like that commercial says, Merry Chrismahanakwanzaka.

  • Comment on ADFS Liability

    My favorite Calgary-ian Pam left the following comment on my last blog post:

    Hm.  In a perfect world, there would need to be a contractual component to any and all technical federations, and those contractual components should go through review by the privacy officer, and also by the admin team.  

    Companies and admin groups need to get religion over the process involved with creation of federations, if for no other reason than to protect themselves from liability.  

    Here is more about liability and federation:  
    http://www.rsasecurity.com/go/siliconcom/liability.asp

    Cheers,

    Pam

    I am SOOOO glad she did too, because liability is one of the hardest problems to deal with when deploying ADFS, and something that I've personally been harping about to our internal deployment team as we develop our onboarding process for new federations.  In fact, one of the topics that I've been presenting at various TechEd's and the upcoming ITForum lately, is "How Microsoft IT Deployed Active Directory Federation Services".  In that talk, I've dedicated an entire slide, to just some of the impacts that liability can have, whether your providing the user authentication or the resources.  In fact, one of the comments that I make during my talk, is:

    I've been involved with Microsoft's Active Directory for 5 years, and never had any reason.  But I was tasked with deploying AD Federation Services, and within a week of the project starting up, I had met with some attorneys in our Legal and Corporate Affairs group.

    A great example of technology vs. liability is the ongoing discussion that we're having with one of our business partners, about providing federated access to their internet portal.  This partner though, happens to be one of the providers of financial services to Microsoft employee's.  From the partners perspective, the idea of federation is wonderful...they see it increasing their security, reducing their risk (since they still allow SSN's as user names), and reducing the amount of overhead they have for constantly resetting users passwords.  In fact, one of their architects commented that there were nearly as many users who require a password reset EVERY TIME a user attempted a login, as there were who didn't.

    Enter the Microsoft attorneys...

    They looked at the technology, and got a pretty quick understanding of the risks, limitations, and potential uses for ADFS.  They just as quickly built the following scenario

    So Joe User's password gets compromised.  Not only can someone use it to gain access to some set of corporate resources, but now they can also go in and mess around with his retirement portfolio?  And they would do this, because during the logon attempt, "Microsoft" verified that the user was actually Joe?  Ummmm....No.

    This is basically the story of how Microsoft has ended up asking some of their higher impact business partners, to create a 2-tiered authentication model.  In this case, a user can log in using ADFS authentication to view their information...but as soon as they want to make a change to their information, they'll need to enter their application specific credentials.

    According to the partner, approximately 85% of all logons are just to view the data anyways, so it's still a win...but it also virtually guarantee's that when a user does want to make a trade, they'll need to reset the password because now they DEFINITELY are not going to remember what it is.

    So what does all this mean - it means that I agree 100% with Pam's comment, that IT people are going to have to get religion over the process of creating federations, and the impact that it has to their business.

     

  • AD and DC Builds, tweaks, configurations... (1)

    I received a mail from a blog reader (Jim) who asked:

    "Can you provide any insight regarding and tweaks or configuration settings you guys use on your DC builds?"

    Sure, I'm happy to do this, so here I am typing happily along, and realized that there is a lot more configuration/tweaking/settings that we use than I should reasonably put into a single blog entry.  Instead this will be the first of multiple entries...

    So, let's start at  the very beginning (it's a very good place to start)...  With our standard hardware platforms.  All MS IT domain controllers are based on either our "large" or "small" SKU...internally, we call these are DC-E (enterprise) or DC-F (field) platforms.

    The DC-E specs are:

    • DL585
    • 2 x 1.8GHz AMD Opteron (64-bit) dual core processors
    • 16GB RAM
    • 172GB total storage
      • Internal Array Controller - 2 x 72GB - RAID 1
        • 50GB OS partition
        • 18.8GB partition for Log files (L: Drive)
      • Array Controller 1 - External Storage - 6 x 36GB - RAID 0+1
        • 103.2GB partition for DIT, SYSVOL, Backups (M: Drive)

    The DC-F specs are:

    • DL385
    • 1 x 2.2GHz AMD Opteron (64-bit) dual core processor
    • 8GB RAM
    • 137GB total storage
      • Internal Array Controller
      • Disk 0 - RAID 1 - 2 x 72GB
        • 50GB OS partition
        • 18.8GB partition for Log files (L: Drive)
      • Disk 1 - RAID 0 + 1 - 4 x 36GB
        • 68.8GB partition for DIT, SYSVOL, Backups (M: Drive)

    All of our DC's run x64 OS's...well...unless we have some dogfood requirement for 32-bit OS runtime (which we periodically do)...but for all intents and purposes, let's just pretend because we really WANT to run all 64-bit OS's.

    Somewhere previously I mentioned that our average DIT size is 10-11GB on disk.  The DC-E with 16GB of RAM let's us cache the entire database with room for growth, the DC-F with only 8GB of RAM is usually deployed where we need services, but don't have the load so caching is less of an issue.  In that case, the DC-F is significantly cheaper for us.

     

  • AD and DC Builds, tweaks, configurations... The Registry

    The first installment, what our hardware looks like, may have been useful...but I know that's not really the juicy gossip that everyone is looking for...so here's a quick and follow-up with the registry tweaks that we set internally...

    Strict Replication is enabled on Windows Server 2003 - For Windows 2000 there is the "Correct Missing Objects" key which has similar (though reversed) funcationality.  Basically, this stops a DC from replicating lingering objects
    HKLM\system\currentcontrolset\services\NTDS\parameters" /v "strict replication consistency" /t REG_DWORD /d  0x1

    The Exchange team requires this for RPC/HTTPS functionality
    HKLM\system\currentcontrolset\services\NTDS\parameters" /v "NSPI interface protocol sequences" /t  REG_MULTI_SZ /d "ncacn_http:6004"

    Causes an event to be logged after each online defrag task.  The event includes file statistics about the DIT including whitespace.  We run a seperate task to harvest these events for database file maintenance.
    HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics /v "6 Garbage Collection" /t REG_DWORD /d 1

    Set to 5 causes an event to be logged for "expensive" and "inefficient" queries.  Extremely useful during troubleshooting isolated load issues.
    HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics /v "15 Field Engineering" /t REG_DWORD /d 5

    The following keys enable the database perfmon counters (note that these are just the reg keys, you have to enable the counters themselves as well using "Lodctr.exe Esentprf.ini")
     HKLM\system\currentcontrolset\Services\ESENT\Performance /v "Open" /t REG_SZ /d "OpenPerformanceData"
    HKLM\system\currentcontrolset\Services\ESENT\Performance /v "Collect" /t REG_SZ /d   "CollectPerformanceData"
    HKLM\system\currentcontrolset\Services\ESENT\Performance /v "Close" /t REG_SZ /d "ClosePerformanceData"
    HKLM\system\currentcontrolset\Services\ESENT\Performance /v "Library" /t REG_SZ /d  "%systemroot%\system32\esentprf.dll"
    HKLM\system\currentcontrolset\Services\ESENT\Performance /v "Squeaky Lobster" /t REG_DWORD /d 1

    Just what it sounds like.  Causes DFS to use site costed referrals.
    HKLM\System\Currentcontrolset\Services\DFS\Parameters" /v "SiteCostedReferrals" /t REG_DWORD /d 1

    Last but not least, on some of the servers we set LdapSrvPriority and LdapSrvWeight.  These are used for load balancing and isolation, but are not consistent across all of our servers.  Older/slower hardware gets lower weight, and special case servers that we want to shield from general traffic get higher priorities.  Check here for more info on these keys:  http://support.microsoft.com/?id=306602

     

  • x64 Domain Controllers

    Had an e-mail thread with Joe recently, which also resulted in this blog entry.  He's a consultant for another big tech company, and was working with a customer that was migrating a lot of non-domain joined machines to AD as well as deploying other AD aware applications.  The net result though, is that he was in the unenviable position of having no performance baseline to go off of, and a bunch of customers asking how many 64-bit domain controllers they needed to buy.  And therein lies the problem, there just aren't that many 64-bit DC's deployed out there (yet), so if you're starting from scratch, where do you start?

    Well, to make a long story short (too late), a few e-mail back and forth later and I fired off some of the stats that we use internally here at Microsoft.  In the spirit of copy/paste, here's the mail I sent (slightly edited to protect the innocent), if you don't have anything else to go on or just want some general reference...then you can use this.

    REMEMBER - "IT DEPENDS" and "YOUR MILEAGE WILL VARY"

    ________________________________________
    From: Brian Puhl [mailto:Brian.Puhl@microsoft.com]
    Sent: Wednesday, September 06, 2006 6:11 PM
    To: Joe
    Subject: RE: Ping...

    Well, like you said, “it depends” and “your mileage WILL vary.” 

    It’s tough, because we don’t plan based on numbers of users, workstations, or anything like that…  We base capacity on performance trends, which I realize is ultimately where you’re trying to get <customer> to…  So instead, here are some details from our Redmond domain.  These are live numbers, which you can use to approximate.  Remember that MS is probably a higher utilization environment than <customer>, so you can use these to build a deployment plan with the expectation that you could end up slightly over capacity. 

    Domain Details:
       99%+ of the users are in a single AD site, so assume that this is all for a single site.
       49K user accounts (includes service accounts, etc…)
       160K computer accounts 
       17 DC’s for authentication load, app’s – everything but exchange
       5 DC’s in a separate dedicated Exchange site, shielded from auth load

    Typical auth DC spec
       HP DL585
       4 x 2.2GHz AMD64
       16GB RAM (12 GB dit file)
       2 or 4 spindles (0+1) for OS and logs
       6 spindles (0+1) for dit, backup, and sysvol
               
    Typical load profile (randomly picked a DC and pulled open perfmon while I’m typing this mail) – see note below
       Ave CPU – 55% 
       Ave Disk Queue – 0.1

       Server Sessions – 585
       NTLM Auths – 215
       Kerb Auths – 92
       DS Client Binds/Sec – 44

       Gigabit NIC card
       NIC Output Queue – 0

    Major thing to note about the perf data – We’ve got 3 DC’s offline at the moment due to dogfooding, so this perf load would be with 14 DC’s online.  Our target utilization is 20-40% sustained peak CPU.

    Also, based on our experience, we’re rarely NIC bound.  When we see overloaded DC’s, they typically tend to be disk bound or processor bound.  Even when we had x86 with 4GB of RAM, the memory pressure just translated into disk queues, so when you’re spec’ing out your servers I would be least concerned about the connectivity.  You probably also noticed in the whitepaper that x64 doesn’t give you a whole lot of benefit in a pure auth environment.  These operations tend to be disk bound even in a 64-bit OS.

    I think you’re hoping for a “5000-10000 user” type answer, but even if I gave you a completely wild guess, It would probably do more harm than good in your conversations with the customer.

    Does this give you a better idea?  Are there other details that would help you make a better guess? 

    The whitepaper that I referred to is the Active Directory 64-bit Performance Comparison paper, located here.