• Witness Server Warning Message When Using Certain Database Availability Group Tasks

    Recently, some customers reported that when they create a DAG, they get a warning message that states the following:

    The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server <ServerName>.

    In these cases, the customer’s intended witness server was not an Exchange 2010 server.  As documented in TechNet, if the witness server you specify isn't an Exchange 2010 server, you must add the Exchange Trusted Subsystem (ETS) universal security group (USG) to the local Administrators group on the witness server. These security permissions are necessary to ensure that Exchange can create a directory and share on the witness server as needed.

    After some inspection, the customers confirmed that, contrary to the error message, the ETS USG was a member of the local administrators group on their intended witness server.  Moreover, even though this warning appeared, there were no ill effects in functionality.  The directory and share on the witness server were created as needed, the file share witness cluster resource was online, and the DAG passed all replication health checks.

    After hearing about this, I went to my lab to test this, and I was able to reproduce the issue.  I added the ETS USG to the local administrators group on my witness server (a Windows 2008 file server) and ran New-DatabaseAvailabilityGroup, specifying my witness server.  I received the same warning message, and verified that despite the message, all was perfectly healthy with the DAG, and there were no permission problems, witness server or cluster problems or other issues.

    Even though it appeared as though this warning message could be safely ignored, I wondered why we were getting it in the first place.  So I went digging into the source code to find out.

    Let me describe what is happening and why you, too, can safely ignore the warning message.

    During various DAG-related tasks that configure witness server properties (namely, New-DatabaseAvailabilityGroup, Set-DatabaseAvailabilityGroup and Restore-DatabaseAvailabilityGroup), the code is actually checking to see if the witness server is a member of the Exchange Trusted Subsystem USG.

    As you may know, there is no requirement that the witness server be a member of the ETS USG.  Nonetheless, the code for these tasks does check for this, and if it finds that the witness server is not a member of the ETS USG, it issues a warning message.

    Unfortunately, to confuse things even more, the warning message says:

    The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server <ServerName>.

    It says nothing about the witness server not being a member of the ETS USG, even though the code is checking for that.  Instead, it makes it appear as though the permission perquisites have not been satisfied, even though they actually have.

    But, even though the message does not pertain to the actual check that failed, that does not make this a string bug.  This is a code bug, as there is no requirement that the witness server be a member of the ETS USG.  Thus, the code should not be checking for this condition.  If this bug is fixed and the check is removed, the string will be removed with it. Unless and until that happens, if you are seeing this warning message when you are using any of the above-mentioned tasks, and you have verified that the ETS USG is a member of the local administrators group on your witness server, then you can likely safely ignore the warning message. You should run Test-ReplicationHealth to verify the health of the DAG once members have been added to it.

    Because we are doing this check in code, you can of course add the witness server to the ETS group, and also make the ETS group a member of the local administrators group on the witness server, and all of these tasks will complete without this warning message. But, don't do that in production because (1) it is not needed and (2) it gives the witness server way more permissions than it should ever have (unless of course, the witness server is an Exchange 2010 server).

  • Exchange 2010 High Availability Misconceptions Addressed

    We just published a blog post I wrote about some common Exchange 2010 high availability misconceptions that I heard repeated at Tech.Ed North America.  In it, I discuss and dispel several misconceptions:

    • Misconception Number 1: The Alternate Witness Server (AWS) provides redundancy for the Witness Server (WS)
    • Misconception Number 2: Microsoft recommends that you deploy the Witness Server in a third datacenter when extending a two-member DAG across two datacenters
    • Misconception Number 2a: When I have a DAG with an even number of members that is extended to two datacenters, placing the witness server in a third datacenter enhances resilience
    • Misconception Number 3: Enabling DAC mode prevents automatic failover between datacenters; therefore, if I want to create a datacenter failover configuration, I shouldn’t enable DAC mode for my DAG
    • Misconception Number 4: The AutoDatabaseMountDial setting controls how many log files are thrown away by the system in order to mount a database
    • Misconception Number 5: Hub Transport and Client Access servers should not have more than 8 GB of memory because they run slower if you install more than that
    • Misconception Number 6: A Two-Member DAG is designed for a small office with 250 mailboxes or less

    Enjoy!

  • Understanding and Troubleshooting Microsoft Exchange Server Integration

    Today we released a new Word document that introduces you to some of the new client features that are available when Microsoft Lync Server 2010 is integrated with Exchange Server 2010.

    Lync 2010, as well as other unified communications (UC) clients and devices, interact with Microsoft Exchange and Microsoft Outlook to provide integrated features to the end user, such as:

    • Contact Information
    • Calendar Information
    • Conversation History
    • Missed Conversations
    • Missed Calls
    • Voice Mail Playback

    Successfully integrating these two enterprise communications solutions can be challenging, especially considering that there are subtle differences in the way that services from each product are leveraged by Lync Server 2010 clients.  The information contained in the document is not intended to be authoritative with regard to these topics. Rather, it is a collection of information that the author (Dave Howe) gathered from various product specifications as well as some general troubleshooting information.

    Enjoy!

  • Exchange 2010 SP1 and Windows Bugchecks

    In case you aren’t familiar with the word, a bugcheck is one of several technical terms used to describe the situation in which an operating system halts because it has encountered an error that prevents it from safely continuing to operate.  Other technical terms we used to describe this condition include:

    • Kernel panic
    • System halt
    • Fatal system error
    • Stop error

    And some non-technical terms to describe this condition include:

    • System crash
    • Blue screen of death (BSOD)

    When this condition occurs, the system creates a system dump (also known as memory dump or crash dump), which provides information about what the system was doing at the time, which can be very useful in debugging the problem and determining why the bugcheck occurred in the first place.  Depending on how the administrator has configured the operating system, after the system dump is written to disk (if possible), the operating system may restart itself as a form of self-corrective action.

    Exploiting Bugcheck Behavior

    I sometimes hear administrators describe a bugcheck as a bad thing.  The bugcheck behavior itself is a good thing.  It’s the problem that caused the bugcheck to occur that is the bad thing.  Simply put, bugchecking is there because it enables the system to try to recover from an otherwise unrecoverable error.  Understanding bugchecks for what they are lends itself to understanding how an application might exploit this behavior to its own advantage. For example, in Windows Server 2008 R2, new logic was added to Windows Failover Clustering (WFC) that enabled WFC to self-recover under specific conditions.  When certain errors occur in a cluster running Windows 2008 R2 that are catastrophic and unrecoverable, WFC will intentionally bugcheck the server as a last resort method of recovery.

    Exchange 2010 SP1 Bugcheck Behavior

    In Exchange 2010 SP1, we added logic to the system that leverages bugcheck behavior when certain conditions occur.  Specifically, when hung IO occurs.  In SP1, Extensible Storage Engine (ESE) has been updated to detect hung IO and to take corrective action to automatically recover the server.  ESE keeps an IO watchdog thread that will detect when an IO has been outstanding. If the IO is outstanding for more than one minute, ESE will log an event. If an Exchange database has an IO outstanding for greater than 4 minutes, it will log a specific failure event, if it is possible to do so. ESE event 507, 508, 509 or 510 may or may not be logged, depending on the nature of the hung IO.  Obviously, if the nature of  the problem is such that the OS volume is affected or the ability to write to the event log is affected, the events will not be logged. If the events are logged, the Microsoft Exchange Replication service (MSExchangeRepl.exe) will detect those failure events and intentionally cause a bugcheck of Windows by terminating the wininit.exe process.

    In many of the hung IO incidents we have seen, the entire stack has been affected by the hang, making it impossible to write failure events to the crimson channel or any other area of the event log.  So ESE also monitors the crimson channel by verifying that the event log can be written to. If writing to the event log fails for a long period of time, MSExchangeRepl will intentionally cause a bugcheck of Windows by terminating wininit.exe. When this condition occurs, obviously the system is unable to write any ESE events to the event log.

    When the bugcheck does occur, it will always be as follows:

    CRITICAL_OBJECT_TERMINATION (f4)
    A process or thread crucial to system operation has unexpectedly exited or been terminated.

    NOTE: the presence of this bugcheck does not necessarily mean Exchange was the cause.  Any termination of wininit.exe, including one performed by an administrator using Task Manager or some other task management tool, will cause this bugcheck error code.

    Conclusion

    The hung IO detection feature in Exchange 2010 is designed to make recovery from hung IO or a hung controller fast, rather than re-trying or waiting until the storage stack raises an error that causes failover.  It’s a great addition to the set of high availability features built into Exchange 2010.

  • UDP Notification is coming to Exchange 2010

    We just made another cool announcement this week about UDP notification and Exchange 2010.  If you have clients running Outlook 2003 against Exchange 2010, check out this post.