I’ve recently returned from TechEd North America 2011 in Atlanta, Georgia, where I had a wonderful time seeing old friends and new, and talking with customers and partners about my favorite subject: high availability in Exchange Server 2010. In case you missed TechEd, or were there and missed some sessions, you can download slide decks and watch presentations on Channel 9.
While at TechEd, I noticed several misconceptions that were being repeated around certain aspects of Exchange HA. I thought a blog post might help clear up these misconceptions.
But first let’s start with some terminology to make sure everyone understands the components, settings and concepts I’ll be discussing.
OK, that’s enough terms for now.
Now let’s discuss (and dispel) these misconceptions, in no particular order.
The actual name – Alternate Witness Server – originates from the fact that its intended purpose is to provide a replacement witness server for a DAG to use after a datacenter switchover. When you are performing a datacenter switchover, you’re restoring service and data to an alternate or standby datacenter after you’ve deemed your primary datacenter un-usable from a messaging service perspective.
Although you can configure an Alternate Witness Server (and corresponding Alternate Witness Directory) for a DAG at any time, the Alternate Witness Server will not be used by the DAG until part-way through a datacenter switchover; specifically, when the Restore-DatabaseAvailabilityGroup cmdlet is used.
The Alternate Witness Server itself does not provide any redundancy for the Witness Server, and DAGs do not dynamically switch witness servers, nor do they automatically start using the Alternate Witness Server in the event of a problem with the Witness Server.
The reality is that the Witness Server does not need to be made redundant. In the event the server acting as the Witness Server is lost, it is a quick and easy operation to configure a replacement Witness Server from either the Exchange Management Console or the Exchange Management Shell.
In this scenario, you have one DAG member in the primary datacenter (Portland) and one DAG member in a secondary datacenter (Redmond). Because this is a two-member DAG, it will use a Witness Server. Our recommendation is (and has always been) to locate the Witness Server in the primary datacenter, as shown below.
Figure 1: When extending a two-member DAG across two datacenters, locate the Witness Server in the primary datacenter
In this example, Portland is the primary datacenter because it contains the majority of the user population. As illustrated below, in the event of a WAN outage (which will always result in the loss of communication between some DAG members when a DAG is extended across a WAN), the DAG member in the Portland datacenter will maintain quorum and continue servicing the local user population, and the DAG member in the Redmond datacenter will lose quorum and will require manual intervention to restore to service after WAN connectivity is restored.
Figure 2: In the event of a WAN outage, the DAG member in the primary datacenter will maintain quorum and continue servicing local users
The reason for this behavior has to do with the core rules around quorum and DAGs, specifically:
Going back to our example, consider the placement of the Witness Server in a third datacenter, which would look like the following:
Figure 3: Locating the Witness Server in a third datacenter does not provide you with any different behavior
The above configuration does not provide you with any different behavior. In the event WAN connectivity is lost between Portland and Redmond, one DAG member will retain quorum and one DAG member will lose quorum, as illustrated below:
Figure 4: In the event of a WAN outage between the two datacenters, one DAG member will retain quorum
Here we have two DAG members; thus two voters. Using the formula V/2 + 1, we need at least 2 votes to maintain quorum. When the WAN connection between Portland and Redmond is lost, it causes the DAG’s underlying cluster to verify that it still has quorum.
In this example, the DAG member in Portland is able to place an SMB lock on the witness.log file on the Witness Server in Olympia. Because the DAG member in Portland is the locking node, it gets the weighted vote, and now therefore holds the two votes necessary to retain quorum and keep its cluster and DAG functions operating.
Although the DAG member in Redmond can communicate with the Witness Server in Olympia, it cannot place an SMB lock on the witness.log file because one already exists. And because it cannot communicate with the locking node, the Redmond DAG member is in the minority, it loses quorum, and it terminates its cluster and DAG functions. Remember, it doesn’t matter if the other DAG members can communicate with the Witness Server; they need to be able to communicate with the locking node in order to participate in quorum and remain functional.
As documented in Managing Database Availability Groups on TechNet, if you have a DAG extended across two sites, we recommend that you place the Witness Server in the datacenter that you consider to be your primary datacenter based on the location of your user population. If you have multiple datacenters with active user populations, we recommend using two DAGs (also as documented in Database Availability Group Design Examples on TechNet).
In addition to Misconception Number 2, there is a related misconception that extending an even member DAG to two datacenters and using a witness server in a third enables greater resilience because it allows you to configure the system to perform a “datacenter failover.” You may have noticed that the term “datacenter failover” is not defined above in the Terminology section. From an Exchange perspective, there’s no such thing. As a result, no configuration can enable a true datacenter failover for Exchange.
Remember, failover is corrective action performed automatically by the system. There is no mechanism to achieve this for datacenter-level failures in Exchange 2010. While the above configuration may enable server failovers and database failovers, it cannot enable datacenter failovers. Instead, the process for recovering from a datacenter-level failure or disaster is a manual process called a datacenter switchover, and that process always begins with humans making the decision to activate a second or standby datacenter.
Activating a second datacenter is not a trivial task, and it involves much more than the inner workings of a DAG. It also involves moving messaging namespaces from the primary datacenter to the second datacenter. Moreover, it assumes that the primary datacenter is no longer able to provide a sufficient level of service to meet the needs of the organization. This is a condition that the system simply cannot detect on its own. It has no awareness of the nature or duration of the outage. Thus, a datacenter switchover is always a manual process that begins with the decision-making process itself.
Once the decision to perform a datacenter switchover has been made, performing one is a straightforward process that is well-documented in Datacenter Switchovers.
Datacenter Activation Coordination (DAC) mode has nothing whatsoever to do with failover. DAC mode is a property of the DAG that, when enabled, forces starting DAG members to acquire permission from other DAG members in order to mount mailbox databases. DAC mode was created to handle the following basic scenario:
In this scenario, the starting DAG members in the primary datacenter have no idea that a datacenter switchover has occurred. They still believe they are responsible for hosting active copies of databases, and without DAC mode, if they have a sufficient number of votes to establish quorum, they would try to mount their active databases. This would result in a bad condition called split brain, which would occur at the database level. In this condition, multiple DAG members that cannot communicate with each other both host an active copy of the same mailbox database. This would be a very unfortunate condition that increases the chances of data loss, and make data recovery challenging and lengthy (albeit possible, but definitely not a situation we would want any customer to be in).
The way databases are mounted in Exchange 2010 has changed. Yes, the Information Store still performs the mount, but it will only do so if Active Manager asks it to. Even when an administrator right-clicks a mailbox database in the EMC and selects Mount Database, it is Active Manager that provides the administrative interface for that task, and performs the RPC request into the Information Store to perform the mount operation (even on Mailbox servers that are not members of a DAG).
Thus, when every DAG member starts, it is Active Manager that decides whether or not to send a mount request for a mailbox database to the Information Store. When a DAG is enabled for DAC mode, this startup and decision-making process by Active Manager is altered. Specifically, in DAC mode, a starting DAG member must ask for permission from other DAG members before it can mount any databases.
DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). That’s a very fancy name for something that is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.
The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0. In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:
If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.
Reverting back to the intended DAC mode scenario, when power is restored to the primary datacenter without WAN connectivity, the DAG members starting up in that datacenter can communicate only with each other. And because they are starting up from a power loss, their DACP bit will be set to 0. As a result, none of the starting DAG members in the primary datacenter are able meet either of the conditions above and are therefore unable to change their DACP bit to 1 and issue mount requests.
So that’s how DAC mode prevents split brain at the database level. It has nothing whatsoever to do with failovers, and therefore leaving DAC mode disabled will not enable automatic datacenter failovers.
By the way, as documented in Understanding Datacenter Activation Coordination Mode on TechNet, a nice side benefit of DAC mode is that it also provides you with the ability to use the built-in Exchange site resilience tasks.
This is a case where two separate functions are being combined to form this misperception: the AutoDatabaseMountDial setting and a feature known as Incremental Resync (aka Incremental Reseed v2). These features are actually not related, but they appear to be because they deal with roughly the same number of log files on different copies of the same database.
When a failure occurs in a DAG that affects the active copy of a replicated mailbox database, a passive copy of that database is activated one of two ways: either automatically by the system, or manually by an administrator. The automatic recovery action is based on the value of the AutoDatabaseMountDial setting.
As documented in Understanding Datacenter Activation Coordination Mode, this dial setting is the administrator’s way of telling a DAG member the maximum number of log files that can be missing while still allowing its database copies to be mounted. The default setting is GoodAvailability, which translates to 6 or fewer logs missing. This means if 6 or fewer log files never made it from the active copy to this passive copy, it is still OK for the server to mount this database copy as the new active copy. This scenario is referred to as a lossy failover, and it is Exchange doing what it was designed to do. Other settings include BestAvailability (12 or fewer logs missing) and Lossless (0 logs missing).
After a passive copy has been activated in a lossy failover, it will create log files continuing the log generation sequence based on the last log file it received from the active copy (either through normal replication, or as a result of successful copying during the ACLL process). To illustrate this, let’s look at the scenario in detail, starting before a failure occurs.
We have two copies of DB1; the active copy is hosted on EX1 and the passive copy is hosted on EX2. The current settings and mailbox database copy status at the time of failure are as follows:
At this point, someone accidentally powers off EX1, and we have a lossy failover in which DB1\EX2 is mounted as the new active copy of the database. Because E0000000006 is the last log file DB1\EX2 has, it continues the generation stream, creating log files E0000000007, E0000000008, E0000000009, E0000000010, and so forth.
An administrator notices that EX1 is turned off and they restart EX1. EX1 boots up and among other things, the Microsoft Exchange Replication service starts. The Active Manager component, which runs inside this service, detects that:
Any time a lossy failover occurs where there original active copy may be viable for use, there is always divergence in the log stream that the system must deal with. This state causes DB1\EX1 to automatically invoke a process called Incremental Resync, which is designed to deal with divergence in the log stream after a lossy failover has occurred. Its purpose is to resynchronize database copies so that when certain failure conditions occur, you don’t have to perform a full reseed of a database copy.
In this example, divergence occurred with log generation E0000000007, as illustrated below:
Figure 5: Divergence in the log stream occurred with log E0000000007
DB1\EX2 received generations 1 through 6 from DB1\EX1 when DB1\EX1 was the active copy. But a failover occurred, and logs 7 through 10 were never copied from EX1 to EX2. Thus, when DB1\EX2 became the active copy, it continued the log generation sequence from the last log that it had, log 6. As a result, DB1\EX2 generated its own logs 7-10 that now contain data that is different from the data contained in logs 7-10 that were generated by DB1\EX1.
To detect (and resolve) this divergence, the Incremental Resync feature starts with the latest log generation on each database copy (in this example, log file 10), and it compares the two different log files, working back in the sequence until it finds a matching pair. In this example, log generation 6 is the last log file that is the same on both systems. Because DB1\EX1 is now a passive copy, and because its logs 7 through 10 are diverged from logs 7 through 10 on DB1\EX2, which is now the active copy, these log files will be thrown away by the system. Of course, this does not represent lost messages because the messages themselves are recoverable through the Transport Dumpster mechanism.
Then, logs 7 through 10 on DB1\EX2 will be replicated to DB1\EX1, and DB1\EX1 will be a healthy up-to-date copy of DB1\EX2, as illustrated below:
Figure 6: Incremental Resync corrects divergence in the log stream
I should point out that I am oversimplifying the complete Incremental Resync process, and that it is more complicated than what I have described here; however, for purposes of this discussion only a basic understanding is needed.
As we saw in this example, even though DB1\EX2 lost four log files, it will still able to mount as the new active database copy because the number of missing log files was within EX2’s configured value for AutoDatabaseMountDial. And we also saw that, in order to correct divergence in the log stream after a lossy failover, the Incremental Resync function threw away four logs files.
But the fact that both operations dealt with four log files does not make them related, nor does it mean that the system is throwing away log files based on the AutoDatabaseMountDial setting.
To help understand why these are really not related functions, and why AutoDatabaseMountDial does not throw away log files, consider the failure scenario itself. AutoDatabaseMountDial simply determines whether a database copy will mount during activation based on the number of missing log files. The key here is the word missing. We’re talking about log files that have not been replicated to this activated copy. If they have not been replicated, they don’t exist on this copy, and therefore, they cannot be thrown away. You can’t throw away something you don’t have.
It is also important to understand that the Incremental Resync process can only work if the previous active copy is still viable. In our example, someone accidentally shut down the server, and typically, that act should not adversely affect the mailbox database or its log stream. Thus, it left the original active copy intact and viable, making it a great candidate for Incremental Resync.
But let’s say instead that the failure was actually a storage failure, and that we’ve lost DB1\EX1 altogether. Without a viable database, Incremental Resync can’t help here, and all you can do to recover is to perform a reseed operation.
So, as you can see:
This has been followed by statements like:
a Hub Transport server with 16 GB of memory runs twice as slow as a Hub Transport server with 8 GB of memory, and the Exchange 2010 server roles were optimized to run with only 4 to 8 GB of memory.
This misconception isn’t directly related to high availability, per se, but because scalability and cost all factor into any Exchange high availability solution, it’s important to discuss this, as well, so that you can be confident that your servers are sized appropriately and that you have the proper server role ratio.
It is also important to address this misconception because it’s blatantly wrong. You can read our recommendations for memory and processors for all server roles and multi-role servers in TechNet. At no time have we ever said to limit memory to 8 GB or less on a Hub Transport or Client Access server. In fact, examining our published guidance will show you that the exact opposite is true.
Consider the recommended maximum number of processor cores we state that you should have for a Client Access or Hub Transport server. It’s 12. Now consider that our memory guidance for Client Access servers is 2 GB per core and for Hub Transport it is 1 GB per core. Thus, if you have a 12-core Client Access server, you’d install 24 GB of memory, and if you had a 12-core Hub Transport server, you would install 12 GB of memory.
Exchange 2010 is a high-performance, highly-scalable, resource-efficient, enterprise-class application. In this 64-bit world of ever-increasing socket and core count and memory slots, of course Exchange 2010 is designed to handle much more than 4-8 GB of memory.
Microsoft’s internal IT department, MSIT knows first-hand how well Exchange 2010 scales beyond 8 GB. As detailed in the white paper, Exchange Server 2010 Design and Architecture at Microsoft: How Microsoft IT Deployed Exchange Server 2010, MSIT deployed single role Hub Transport and Client Access servers with 16 GB of memory.
It has been suggested that a possible basis for this misconception is a statement we have in Understanding Memory Configurations and Exchange Performance on TechNet that reads as follows:
Be aware that some servers experience a performance improvement when more memory slots are filled, while others experience a reduction in performance. Check with your hardware vendor to understand this effect on your server architecture.
The reality is that, the statement is there because if you fail to follow your hardware vendor’s recommendation for memory layout, you can adversely affect performance of the server. This statement, while important for Exchange environments, has nothing whatsoever to do with Exchange, or any other specific application. It’s there because server vendors have specific configurations for memory based on a variety of elements, such as chipset, type of memory, socket configuration, processor configuration, and more. By no means does it mean that if you add more than 8 GB, Exchange performance will suffer. It just means you should make sure your hardware is configured correctly.
As stated in the article, and as mentioned above:
This misconception is really related more to Misconception Number 5 than to high availability, because again it’s addressing the scalability of the solution itself. Like Misconception Number 5, this one is also blatantly wrong.
The fact is, a properly sized two-member DAG can host thousands of mailboxes, scaling far beyond 250 users. For example, consider the HP E5000 Messaging System for Exchange 2010, which is a pre-configured solution that uses a two-member DAG to provide high availability solutions for customers with a mailbox count ranging from 250 up to 15,000.
Ultimately, the true size and design of your DAG will depend on a variety of factors, such as your high availability requirements, your service level agreements, and other business requirements. When sizing your servers, be sure to use the guidance and information documented in Understanding Exchange Performance, as it will help ensure your servers are sized appropriately to handle your organization’s messaging workload.
Have you heard any Exchange high availability misconceptions? Feel free to share the details with me in email. Who knows, it might just spawn another blog post!
For more information on the high availability and site resilience features of Exchange Server 2010, check out these resources: