Exchange 200x clusters install in two separate phases: In the first phase, the administrator runs SETUP.EXE on the CD. The Exchange binaries are copied from the CD to the local harddrive, some IIS bindings are created, miscellaneous local-to-the-node stuff happens. A number of things are read from the AD, but little (if anything) is actually written to the AD in this step. After all nodes of the cluster have this first step run, the administrator moves on to the second phase. In the second phase of the install, the administrator fires up the Cluster Administrator interface (CluAdmin) and creates a system attendant resource. During this second phase, the Exchange Virtual Server (EVS) is created in the AD -- lots of writing to the AD.
The important takeaway from all this is that we don't create an Exchange server object in the configuration container during the first part of the setup process on Exchange 200x cluster, only during the second part.
And that leads to a minor difference between Exchange standalone servers and Exchange clusters. Since running SETUP.EXE on a non-clustered server will basically combine the two cluster setup phases into one big SETUP run, you have to make lots of decisions in the SETUP interface. I'm going to focus on the decision you have to make about what Admin group the server will join (if you have more than one). In a non-clustered server, you have to decide this from within the SETUP program -- BEFORE it ever copies the binaries.
Exchange Clusters, however, prompt you for that decision only during the creation of the System Attendant Resource. But it's important to note that underneath the covers, Exchange Setup still needs to have an Admin Group selected. It needs this even though it's not actually going to use it for anything but filler during this first phase of setup. What it appears to do is select the first Admin Group in the organization, alphabetically.
This can be confusing because if you read through the Exchange Server Setup Progress.Log file, it will imply that setup will be installing this cluster into some random AG in your org -- most likely one you don't want it installed into! But remember, it's not actually going to install it into the AG. Selecting the AG is a part of the phase 2 setup process.
Long story short -- if for whatever reason, the Exchange 200x setup (phase 1) on a cluster stops and prompts you for 5.5 credentials to join an Admin Group *AND* all the fields are blank and unchangeable, you now have the advantage. Knowing that it's simply trying to read the first alphabetical AG from the AD and is evidently failing in some way (permissions, problems talking to the 5.5 server directly, etc), an easy workaround might be to simply create a pure E2k/E2k3 AG through ESM and name it so it will be the first alphabetical AG in your org! Cluster will then read THAT AG instead, and voila, you're past the prompt and no worse for it.
This is the most signifiant problem with Microsoft's software. One just cannot figure out what it is doing without googling, asking, calling MS support etc. This is where Open Source has a definitive edge. I can see what is going on.
Being a software developer, I try to enable different levels of logging, so the maintenance people can:
1. learn how software does what it does
2. quickly diagnose the problems by enabling the relevant traces.
This is a real quote from Slashdot:
"We have a custom COM object that talks to the Exchange server to get calendar/email for users to display on our corporate portal. Exchange 2003 will lock the user that the COM object runs as out ever 2 hours our so. We have to have 2 compiled version of the COM object with different a username/password and swap that object on the server every two hours just to get it to work. This code has been running perfectly under Exchange 2k for some time now. The MS guys have audited the code of the COM object and found no problems it and they are still scrathing their head on what is wrong with their latest and greatest. ".
So even MS maintenance people can't figure out what is wrong - how do you expect us to do that? What tools do you give us to figure those problems by ourselves? Now we have to pay big money to "consultants", which have seen it before. This isn't meant to be offensive, but while working with some of Microsoft's server products I feel like I am very dumb and I DONT LIKE TO FEEL LIKE THAT.
Hoping it helps Microsoft,
mbergal at meta-comm.com
You make a very valid point about complexity and diagnosis, and it's an area I sense we struggle very hard with. Technology is constantly becoming more and more capable, and so supporting it becomes likewise more difficult.
As pain points are encountered, either the product group or PSS put together diagnostic tools that can be used to more rapidly diagnose the problems. Similarly, many third-party vendors produce tools that can assist with various aspects of the server maintenance and problem diagnosis.
I get the feeling that it's very much a desire of MS to provide transparency in product function to our customers, while at the same time protecting MS intellectual property, etc. It's a delicate balance, and these blogs at blogs.msdn.com are an example of that balance.
If you encounter specific pain points while resolving a problem, by all means please send email to email@example.com, bring them to the attention of your PSS engineer the next time you're calling in about something, or post them here. MS is definitely watching these blogs and as a PSS engineer myself, I can tell you we're all about trying to make the product easier to support! :)
Thanks for your comment!
How would you recommend providing recovery for an entire data centre. We have a cluster in a data centre, with SRDF available to mirror data / logs to a remote site. How can we recover using the data at the DR site if we can't use setup with disasterrecovery switch ?
do you have any recommendations ?
Pete - sounds like you are interested in geospan cluster (geocluster, geographically dispersed cluster, etc) solution. You may want to have a look at the Windows 2003 Geographically dispersed clusters whitepaper at: http://www.microsoft.com/windowsserver2003/techinfo/overview/clustergeo.mspx
Also, here's a great case study on Geocluster success with Exchange 2003: http://www.microsoft.com/resources/casestudies/CaseStudy.asp?CaseStudyID=14580
Note that Geoclustering is pretty serious, and can be quite expensive. If you have SRDF mirroring in place to your remote site already, you're probably almost there. If you find that geoclustering is NOT for you for some reason, you still have a few options:
1) You're willing to bring up another cluster in the DR site (note that you can even have a single-node cluster pre-loaded with Exchange, just waiting to create the network name and create the System Attendant resource...). In this case, simply make sure your original AD is available, create the Network Name to exactly match the original server's NN, and then make a new SA resource dependent on the NN. This will link back into the existing AD Exchange Virtual Server object and you'll be ready to restore the databases directly.
2) Or, if you want to use a non-clustered server, you can follow KB.323016 (http://support.microsoft.com/?id=323016) to get back into the data in the DR site.
Thanks. The solution we are looking at for the production site is a 3+1 cluster. Our storage vendor currently only supports geocluster in a 2 node setup which is kind of limiting. We do have the budget to set up another cluster in the Disaster Recovery data centre which we could bring up in the event of a disaster (we have an 8-12 hour SLA). Would we need to delete stuff out of the AD before we recreate the SA ?
Thanks again for your help - these blogs are really good
> Would we need to delete stuff out of the AD before we recreate the SA ?
Nope, in Exchange 2003 cluster when you create a system attendant resource dependent on a network name for which there is already an existing Exchange Virtual Server object in the AD, it will link back into that object and no further changes will be required for the AD -- it's essentially a "/disasterrecovery" switch without the switch...
thanks - sounds sensible. We are going into proof of concept testing on this in a few weeks so I will let you know how we get on
The local Microsoft support guys here are a bit concerned about the support of this option for Exchange 2003. As such I think we will be going down the "boot from SAN" route. Which is supported on our storage and possible on Win2k3 clusters. I would really like to know if the option of moving the EVS discussed here is going to appear in any future E2k3 DR guide or if there are people you know of doing it for real. With the data all replicated and the EVS sitting installed it seems a really neat solution - and my thinking is it must work if products like Doubletake are available. However support and thorough testing don't seem to be around for this option.
any thoughts ? The reason I like this option is it gives you a total separate cluster on which you can do testing / pre-production work during normal operations. Any with the nodes on-line, you know they do not have hardware problems which would kill you in DR
prehaps this is academic as there may be more subtle reasons why this won't work or can't be support which I have missed, in which case I will RTFM :-)
Not sure if there are plans to incorporate this DR option in future documentation. That said, it works just fine and I've used it in test a number of times. If you're working with MS folks, please feel free to have them email me internally and perhaps we can sort out their concerns.
Also can't speak for how Doubletake does it -- again the only clustering solutions MS can support are those on the HCL. These other solutions may work just fine, but it would change your support options a bit as the vendor would need to be your primary source of support.
Just to be clear, I don't see how the option#1 I described will give you the ability to do testing without affecting your production environment. #2 might, but that would still require a bit of work in the test environment to get it prepped.
Can you clarify your question a bit more and perhaps I can help out a bit?
thanks. In a DR test we would power off the production cluster, break the SRDF links and bring up the remote cluster. Check that you can send and recieve email from a test account on each VS and then restart production. All this would be done at like 03:00 on a sunday morning. The SRDF would then need to re-sync from prod so you have a small window (2-3 hours) where you do not have a full sync copy. Also any email sent during the DR test is lost (but that is ok)
So during the testing all email would be offline effectively for the duration of the test and advertised as such to the users.
If there is something preventing us doing this repeatedly (once a quarter) or flicking back again then it rules out our using it. The clusters we are using are certified 4-node clusters and we would set them up as normal clusters. I am just looking for a way to have the DR cluster on-line during normal operation - though with the Exchange VS off-line of course.
Thanks again for your assistance on this
It should be fairly easy and non-destructive to test, so long as you end up discarding the SRDF replica. Before I would rely on this as a solution, I'd set it up in a lab (SRDF not required) and verify that deleting the SA on the "production" (in the lab) cluster and recreating it on the "remote" (in the lab) cluster is sucessful. Once you've verified in test that you can move these resources between the two clusters without affecting the AD (still in the lab), you may be ready to move on to production.
I can't think of any reason why this wouldn't be a supported solution by MS (there's even an event logged to tell you how to fix it if the SA is deleted -- same result), but you'll want to work with your direct MS support folks to make sure they're sold...
Thanks. We should have the resources to look at this next week (the kit is in the truck on its way to the test lab as I speak). We should be able to test this without affecting the plan. I will be speaking to the local MS guys tomorrow and will mention our chat (if you are ok with that) and suggest that they email you.
By way of follow up on this to anyone watching for a response -- internal discussions were had and it was determined that although there's no known reasons why this wouldn't work, it's not a recommended solution (proper Geoclustering to the remote location is the recommended solution in this scenario). It's also not a fully tested solution by the Exchange clustering team, so it can't be called officially "supported".
Translation -- like a lot of things it'll probably work fine, but use at your own risk. If you're spending those kind of "availability $$" on your solution, best bet is to use something officially "supported"...