Engineers in Microsoft IT spend an unusually large amount of time talking to customers answering questions which start with:
How Does Microsoft IT Do...<fill in the blank here>?
I'm going to try to start to post some of the more common questions and answers in this blog to "share the wealth", but before I get to the first of these HDMSITD posts, I wanted to put in my "default disclaimer" and a little background information to help ease some confusion...so here it goes:
Default Disclaimer:
Odds are extremely high, that your environment does not look like Microsoft's. In fact, I could guarantee it. We do a lot to our internal deployments that most people would consider reckless, all for the sake of "dogfooding" (testing things on ourselves first). So while I'm babbling on and on about how we do things internally, you should look for the "golden nuggets" that you find interesting. Please don't take what we do wholesale and try it out against your production environment.
Ok, that much being said...I'm going to ramble on a bit about our environment so everyone will have some context of where I'm coming from, again to help find the nuggets that may be useful:
We have multiple production, pre-production, and test forests for various business purposes, most of them are 1 or 2 domain forests but our largest contains an empty root and 8 children geographically dispersed...But wait...doesn't MS say not to use an empty root now? Yes...we do...and given the chance to start all over again, we'd probably have one big happy domain...but I digress... Our main forest has about 100K user accounts, and 300K machine accounts, which represent MS employees in 400+ sites worldwide...our AD database is ~10GB on Windows Server 2003 (~18GB in Windows 2000) and about half of our ~200 DC's are 64-bit, with plans to be fully 64-bit by next summer. One of the nicer things about our environment is that we generally don't have problems with bandwidth, and this is one of the places where we diverge from many of our customers.
We have multiple production, pre-production, and test forests for various business purposes, most of them are 1 or 2 domain forests but our largest contains an empty root and 8 children geographically dispersed...But wait...doesn't MS say not to use an empty root now? Yes...we do...and given the chance to start all over again, we'd probably have one big happy domain...but I digress...
Our main forest has about 100K user accounts, and 300K machine accounts, which represent MS employees in 400+ sites worldwide...our AD database is ~10GB on Windows Server 2003 (~18GB in Windows 2000) and about half of our ~200 DC's are 64-bit, with plans to be fully 64-bit by next summer.
One of the nicer things about our environment is that we generally don't have problems with bandwidth, and this is one of the places where we diverge from many of our customers.
So now you've got some background on where I'm coming from and what our internal environment looks like, which should help put some of our HDMSITD... questions in context.
In the past couple of months, I've been asked at least 3 or 4 times how MS IT determines where on their network to place domain controllers. The questions are usually coming from larger, enterprise type customers and usually sound something like this:
The short answers to these are:
But short answers really miss the whole point, what they are really looking for is how we do DC placement. To start with, we review our DC placement twice yearly, and our capacity planning (performance reviews) 4 times yearly. Based on experience, we should actually do both more often, however it's just not practical. For example, in a single 6 month period we had over a dozen sites in South America with WAN links to North Carolina, change to use Redmond as their hub... (this is where you might ask why the network guys aren't talking to the AD guys?...good question...)
So during our DC placement reviews, we're looking at the following network criteria:
WAN availability - Greater than 99.5% uptime between the end user and their nearest DC Max Average Latency - Less than 500ms...this is a loose target, based on feedback from our users in the regions. This tends to change with our environment, but the feedback we get is that if a user enters their username/pw, goes to get coffee, comes back and it's still "Apply Policies"...they'll call help desk and let us know. Max 95th % utilization - Less than 90% - Typically this isn't an issue, although some of the sites in Africa and the Caribbean come close...
WAN availability - Greater than 99.5% uptime between the end user and their nearest DC
Max Average Latency - Less than 500ms...this is a loose target, based on feedback from our users in the regions. This tends to change with our environment, but the feedback we get is that if a user enters their username/pw, goes to get coffee, comes back and it's still "Apply Policies"...they'll call help desk and let us know.
Max 95th % utilization - Less than 90% - Typically this isn't an issue, although some of the sites in Africa and the Caribbean come close...
With the network topology understood, we then consider some other factors:
Site Classification - This is one of our general categories that tells us whether the building has a secure physical location for a DC, and what the primary type of users are in the site (ie. Sales, PSS, Dev, etc...) Critical Applications - There was a time, when this category was called "Exchange", however our Exchange team has done a massive amount of consolidation and we've uncovered some other applications which are business critical
Site Classification - This is one of our general categories that tells us whether the building has a secure physical location for a DC, and what the primary type of users are in the site (ie. Sales, PSS, Dev, etc...)
Critical Applications - There was a time, when this category was called "Exchange", however our Exchange team has done a massive amount of consolidation and we've uncovered some other applications which are business critical
With all of this information in hand, we throw it together in a pot, sprinkle a little sweat on the keyboard and a dash of Excel, and come up with any places where we either need to add or remove DC's to support our users.
This is normally the part where people say..."Yeah, but what about the number of users in a site? Doesn't that matter?" Not for us, we don't actually count the number of users in a site when determining DC placement. There are two great examples, one is a call center in Texas that has several thousand users, redundant links, high bandwidth, low latency, etc... No local DC. By contrast though, Microsoft Game Studios has a development center in the same area with 50 developers who are using applications that have extremely sensitive authentication requirements, and their business has incredibly aggressive deadlines such that they couldn't tolerate ANY possible network outage impact. They get a DC. Does the number of users matter to us? No, but how the users leverage the DC's does matter when it comes to determining where to deploy the servers.
I was in Office Depot a few weeks ago, and saw that they had fountain pens on display, so I picked up a Waterman Phileas. A few weeks have passed with it sitting on my desk, but this afternoon I decided to open it up and put it together. Thus, pen in hand, surrounded by 3 computers, 2 monitors, and countless other high tech gadgets, I pulled out a peice of paper and started to write a letter to my wife (wouldn't she be surprised)...
...Yeah...can you believe that I can't remember how to write cursive? I don't know how long it's been since I actually HAD to write cursive, at least other than signing my name...and for the life of me I've been trying to do it for the past half hour...
I'm still going to surprise my wife with a nice hand written letter...but it's going to take a day or two.
When a user resets their password, what happens? What about if ALL your users reset their password? Can your infrastructure handle it? Are there "special" changes that you'd want to make? More importantly, this is probably such an edge case that it's not on your top 150 list of things to do? Well, just for kicks, let's look at bulk password resets.
Consider the following scenario:
For whatever reason floated their boat that day, your security team wakes you up in the middle of the night with a mandate that all 20,000 users in your domain need to be forced to change their passwords when the log in the next morning. Your first thought, is probably something about finding a script that can expire the passwords on all of these accounts, but at some point it's likely going to occur to you that the majority of these 20,000 users are all going to come walking into their offices around the same time (8am, 9am, whenever...) and try to log in. Can your DC's handle it?
It turns out that this was a scenario that we wanted to test (without the security boat floating) in January 2003 while we were dogfooding Windows Server 2003. Recently the topic came up again in a discussion, so I dusted out the original report that was written to refresh my memory of what we saw and what we'll do if we ever need to do a mass password reset.
To start with, first you have to assess the true impact. If you're users are spread out across many time zones, then "next morning" is really a relative thing and you should start to consider if any of users are really in the middle of their day. Those are the folks that are going to call help desk on you. In our case, we intentionally reset the passwords of our users here in Redmond, this ensured that nearly everyone was going to come in around the same time (10am'ish) and reset.
Before we get to far into this though, let's look at password reset behavior. When a user logs on and changes their password, there is an extra step that occurs. The NetLogon service of the DC making the password change forwards the new password to the PDC for the domain. Generally this is a good thing, because if the user attempts to authenticate using their new password against a DC that hasn't replicated it in yet, the "failed" password is checked against the PDC for the domain and the user will succeed. Both of these behaviors, the password update and the additional checks, can present a problem during a bulk reset because the volume of changes being pushed from many DC's to one PDC can melt your PDC into slag on the data center floor...and that's always messy to clean up. So the solution is to hide your PDC from the sudden flood of traffic that's going to come from the DC's.
To solve this problem, there is a registry key called AvoidPdcOnWan that needs to be set on every DC (FOR loop with "reg add" is your friend here). Setting this key tells every DC to let normal AD replication take care of getting the password to the PDC. Also, if presented with a bad password, the DC won't check with the PDC but instead will just fail the authentication. The second step is to move the PDC into it's own AD site linked with a site link. In our case, all of the DC's for these 20K users were in a single AD site, which included the PDC. So we created a new site called PDC and linked it to our main site with the minimum replication interval. This effectively put the PDC "on the WAN" to all other DC's.
The net result of these changes is that the bulk of your DC's will do the work AND they'll play nice with the PDC for the domain (always a good thing). Unfortunately there is the behavior change that the DC's won't check a "bad" password with the PDC, so you wouldn't want to leave this config in place all the time as that's a "bad user experience" which normally translates into help desk calls.
Hopefully you never have to reset the passwords for a large bulk of users, but if you do, then setting this reg key, creating a new site, and moving the PDC into it will at least save you from the secondary thrash of watching your PDC slowly fall onto it's side and beg for mercy.