• Exchange 2010 SP1: StartDagServerMaintenance.ps1 fails on databases that have only two database copies.

    In Exchange 2010 Service Pack 1 we introduced some new DAG management scripts. These scripts can be found in the Exchange Server installation directory \ scripts. (This is usually c:\Program Files\Microsoft\Exchange Server\v14\scripts).

     

    One of the scripts introduced is the StartDagServerMaintenance.ps1 script. More information on this script can be found at:

    http://technet.microsoft.com/en-us/library/ff625233.aspx

    http://technet.microsoft.com/en-us/library/dd298065.aspx

     

    When administrators utilize this script the following actions are being taken:

    1) All database copies are moved to another server in the DAG based on the selection of the next best copy.

    2) If the cluster core resources are owned on the node the resources are arbitrated to a different DAG member (thereby moving the Primary Active Manager functionality to another node).

    3) The DatabaseCopyAutoActivationPolicy property of the mailbox server is set to a value of BLOCKED thereby preventing the DAG member from receiving or activating database copies.

    4) The individual database copies hosted on the DAG member are activation suspended.

    5) The node is paused within the cluster service preventing the cluster core resources from arbitrating to the node (and thereby preventing the node from becoming the Primary Active Manager).

     

    When an administrator attempts to place a DAG member into maintenance mode and the DAG member hosts an ACTIVE database that has only two copies the following occurs:

    1)  The database copy is moved to the other node hosting the passive copy (pending the copy is healthy).

    2)  The command fails with the following error after the database is moved.  (In this example the mounted copy is on server DAG-4).

     

    *Pre StartDagServerMaintenance*

    Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                                  Length    Length                             State
    ----                                          ------          --------- ----------- --------------------   ------------

    TESTSCRIPT\DAG-4                              Mounted         0         0                                  Healthy

    TESTSCRIPT\DAG-3                              Healthy         0         0           7/25/2011 10:17:30 AM  Healthy

    *StartDagServerMaintenance*

     

    [PS] C:\Program Files\Microsoft\Exchange Server\V14\Scripts>.\StartDagServerMaintenance.ps1 DAG-4
    The following objects are hosted by 'DAG-4', before attempting to move them off: `n(Database='TESTSCRIPT', Reason='Copy is active'))
    Write-Error : The following objects are still hosted by 'DAG-4', even after attempting to move them off: `n(Database='TESTSCRIPT', Reason='Copy is critical for redundancy according to Red Alert script'))
    At C:\Program Files\Microsoft\Exchange Server\V14\Scripts\StartDagServerMaintenance.ps1:216 char:16
    +                 write-error <<<<  ($StartDagServerMaintenance_LocalizedStrings.res_0014 -f ( PrintCriticalMailboxResourcesOutput($criticalMailboxResources)),$shortServerName) -erroraction:stop
        + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
        + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Microsoft.PowerShell.Commands.WriteErrorCommand

    *Post StartDagServerMaintenance*

     

    Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                                  Length    Length                             State
    ----                                          ------          --------- ----------- --------------------   ------------
    TESTSCRIPT\DAG-3                              Mounted         0         0                                  Healthy
    TESTSCRIPT\DAG-4                              Healthy         0         0           7/25/2011 10:33:57 AM  Healthy

    When an administrator attempts to place a DAG member into maintenance mode and the DAG member hosts an PASSIVE database that has only two copies the following occurs:

    1) The command fails with the following error after the database is moved. (In this example the passive copy is on server DAG-4).

     

    *Pre StartDagServerMaintenance*

     

    Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                                  Length    Length                             State
    ----                                          ------          --------- ----------- --------------------   ------------
    TESTSCRIPT\DAG-3                              Mounted         0         0                                  Healthy
    TESTSCRIPT\DAG-4                              Healthy         0         0           7/25/2011 10:33:57 AM  Healthy

     

    *StartDagServerMaintenance*

     

    [PS] C:\Program Files\Microsoft\Exchange Server\V14\Scripts>.\StartDagServerMaintenance.ps1 DAG-4
    The following objects are hosted by 'DAG-4', before attempting to move them off: `n(Database='TESTSCRIPT', Reason='Copy is active'))
    Write-Error : The following objects are still hosted by 'DAG-4', even after attempting to move them off: `n(Database='TESTSCRIPT', Reason='Copy is critical for redundancy according to Red Alert script'))
    At C:\Program Files\Microsoft\Exchange Server\V14\Scripts\StartDagServerMaintenance.ps1:216 char:16
    + write-error <<<< ($StartDagServerMaintenance_LocalizedStrings.res_0014 -f ( PrintCriticalMailboxResourcesOutput($criticalMailboxResources)),$shortServerName) -erroraction:stop
    + CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
    + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Microsoft.PowerShell.Commands.WriteErrorCommand

     

    *Post StartDagServerMaintenance*

     

    Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                                  Length    Length                             State
    ----                                          ------          --------- ----------- --------------------   ------------
    TESTSCRIPT\DAG-3                              Mounted         0         0                                  Healthy
    TESTSCRIPT\DAG-4                              Healthy         0         0           7/25/2011 10:33:57 AM  Healthy

    Administrators can find manual maintenance mode instructions available in the following blog post:

    http://blogs.technet.com/b/timmcmic/archive/2011/07/25/exchange-2010-sp1-startdagservermaintenance-ps1-fails-when-a-server-contains-databases-with-a-single-copy.aspx

     

    After completing the manual instructions and when maintenance mode is no longer needed the administrator may utilize the StopDagServerMaintenance.ps1 script to revert the manual changes.

  • Exchange 2010: Log Truncation and Checkpoint At Log Creation in a Database Availability Group

    In previous versions of Exchange, when a backup was completed, almost all log files prior to the current log file were truncated from the system.  Administrators monitoring the directory would originally see many logs, and post backup note that only a few logs remained.  In Exchange 2010 Service Pack 1 and later administrators note that multiple log files remain on the disk post backup or the appearance that no log files have truncated at all.  In many cases this leads to a belief that logs are actually not truncating successfully or that there is an issue with backups.

     

    Why do we see logs remaining on disk for longer?  Exchange 2010 SP1 and newer introduces a change in the behavior of log truncation.  The changes were taken to ensure that replicated copies of databases within a database availability group always had the appropriate log files on the source server to complete an incremental resynchronization. 

     

    The change to log truncation is the tracking of Checkpoint At Log Creation.  Remember that in a database availability group we can expect the checkpoint to be approximately 100 logs (or slightly more) off the current log file – this is known as checkpoint depth.  As Exchange creates new log files we stamp into the header of the new log files what log file the checkpoint was pointing at when the current log was created.  For example, let us say that log file 0xA679 (42617) was just created as the current ENN.log.  We can expect that the checkpoint at log creation value stamped within the header of this log file would be approximately 0xA16 (42517).  You can see the checkpoint at log creation value by using eseutil /ml <logfilename> to dump the header of a log file.

     

    [PS] P:\DAG\DAG-DB0\DAG-DB0-Logs>eseutil /ml .\E020000A67E.log

    Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
    Version 14.02
    Copyright (C) Microsoft Corporation. All Rights Reserved.

    Initiating FILE DUMP mode...

          Base name: E02
          Log file: .\E020000A67E.log
          lGeneration: 42622 (0xA67E)
          Checkpoint: (0xA679,8,0)
          creation time: 03/11/2012 06:00:48
          prev gen time: 03/11/2012 04:01:17
          Format LGVersion: (7.3704.16.2)
          Engine LGVersion: (7.3704.16.2)
          Signature: Create time:05/02/2010 18:04:08 Rand:399094376 Computer:
          Env SystemPath: d:\DAG\DAG-DB0\DAG-DB0-Logs\
          Env LogFilePath: d:\DAG\DAG-DB0\DAG-DB0-Logs\
          Env Log Sec size: 512 (matches)
          Env (CircLog,Session,Opentbl,VerPage,Cursors,LogBufs,LogFile,Buffers)
              (    off,   1027,  51350,  16384,  51350,   2048,   2048,  29487)
          Using Reserved Log File: false
          Circular Logging Flag (current file): off
          Circular Logging Flag (past files): off
          Checkpoint at log creation time: (0xA679,8,0)
          1 d:\DAG\DAG-DB0\DAG-DB0-Database\DAG-DB0.edb
                     dbtime: 18078306 (0-18078306)
                     objidLast: 2957
                     Signature: Create time:05/02/2010 18:04:08 Rand:399127765 Computer:
                     MaxDbSize: 0 pages
                     Last Attach: (0xA348,9,86)
                     Last Consistent: (0xA346,9,B5)

          Last Lgpos: (0xa67e,252,0)

    Number of database page references:  770

    Integrity check passed for log file: .\E020000A67E.log


    Operation completed successfully in 0.265 seconds.

     

    In the previous example the checkpoint at log creation is 0xA679.

     

    Within a DAG all servers that contain a replicated copy of a database report the maximum log file that is eligible for truncation.  These values are reported to the active node which subsequently calculates the maximum log file for truncation.  In simplest terms the following process occurs:

     

    • Passive copy on Node-2 reports OK to truncate log 0xA679 (42617).
    • Passive copy on Node-3 reports OK to truncate log file 0xA678 (42616)
    • Passive copy on Node-4 reports ok to truncate log 0xA679 (42617).
    • The active node determines that the best log file eligible for truncation based on the passive copies is 0xA678 (42616).  [This is essentially the minimum of all reported OK logs to truncate.]
    • The active node then looks at the checkpoint at log creation of 0xA678 (42616) and determines that value is 0xA614 (42516).  In this example that would be 100 logs off the best log reported for truncation of the passive copies.
    • The active node sets the truncation point to be log 0xA614 (42516).
    • Therefore after a successful backup logs prior to 0x614 (42516). would truncate.

     

    This essentially means that 100 additional logs that would have previously truncated prior to this change do not truncate.

     

    Taking into account checkpoint at log creation administrators can better understand how log files are truncated and why log files remain on disk after a backup that might have in prior versions been truncated.

     

    ============================

    Update 5/16/2012

    Corrected hex conversions in example.

    ============================

  • MSExchangeRepl 2147 / MSExchangeRepl 2104 / MSExchangeRepl 2127 occurring on Windows 2008 or Windows 2008 R2 with Exchange 2007 Cluster Continuous Replication (CCR)

    When Exchange 2007 CCR is installed on Windows 2008 or Windows 2008 R2 the following error may be noted in the application log of the passive node:

    Log Name: Application
    Source: MSExchangeRepl
    Event ID: 2104
    Task Category: Service
    Level: Error
    Keywords: Classic
    User: N/A
    Computer: MACHINE
    Description:
    Log file action LogCopy failed for storage group EXCLUST01\SG2. Reason:
    CreateFile(
    \\Server\StorageGroupGUID$\LogFile.log) = 2

    If the CCR cluster is not utilizing continuous replication host names the following event series may also be noted:

    Event ID : 2147
    Raw Event ID : 2147
    Source : MSExchangeRepl
    Type : Error
    Machine : SERVER
    Message : There was a problem with 'ActiveNode', which is an alternate name for 'ActiveNode'. The list of aliases is now 'ActiveNode', and the alias 'was' removed from the list. The specific problem is 'CreateFile(
    \\ActiveNode\StorageGroupGuid$\LogFile.log) = 2'.

    ID:       2127
    Level:    Information
    Provider: MSExchangeRepl
    Machine:  SERVER
    Message:  The system has detected a change in the available replication networks.  The system is now using network 'ActiveNode' instead of network 'ActiveNode' for log copying from node ActiveNode.

    In this situation if the solution is aggressively monitored you may not that replication is temporarily failed and then resumes automatically as healthy.  This occurs due to a temporary pause in replication when the error condition is detected, while the replication service attempts to find other replication paths, and then automatically re-attempts the same copy operation.

    If the CCR cluster is utilizing continuous replication host names the following event series may also be noted:

    Event ID : 2147
    Raw Event ID : 2147
    Source : MSExchangeRepl
    Type : Error
    Machine : SERVER
    Message : There was a problem with ‘ReplicationHostName’, which is an alternate name for 'ActiveNode'. The list of aliases is now 'ActiveNode', and the alias 'was' removed from the list. The specific problem is 'CreateFile(
    \\ReplicationHostName\StorageGroupGUID$\LogFile.log) = 2'.

    ID:       2127
    Level:    Information
    Provider: MSExchangeRepl
    Machine:  SERVER
    Message:  The system has detected a change in the available replication networks.  The system is now using network 'ActiveNode' instead of network ‘ReplicationHostName’ for log copying from node ActiveNode.

    Error 2 is ERROR_FILE_NOT_FOUND

    In this situation the error is detected on the replication host name.  The replication service will temporarily pause replication while other network paths are enumerated.  If other continuous replication host names are in use, the replication serivce will select an alternate replication host name and automatically resume log copying.  If the only path valid is the “public” path, the replication service will begin copying log files over the “public” network.  Eventually this error occurs on the public network, forcing network re-enumeration to occur and replication to automatically switch back to the replication network.  If the solution is aggressively monitored, the replication status may be failed during this switch but will automatically resume healthy.

    In almost all incidences these errors are considered benign to the operation of the Exchange Server.

    The replication service is extremely aggressive in its attempts to copy log files.  The replication service is always aware of the next log file in the series that requires copying to the passive node.  As part of normal processes the replication service may query multiple times for the presence of this file and make copy attempts.  These attempts may result in the replication service querying for a  log file that is not fully available.  Under Windows 2003 this was not necessarily an issue.  Windows 2008 introduces a component into SMBv2 that may cause this to be a problem.

    SMBv2 introduces status caching into the LanManWorkstation service.  When an application requests information from a file share, the workstation service caches the response from the server hosting the share.  Subsequent requests for the same information are returned from cache rather than re-contacting the server hosting the share.  Eventually this cache will expire (in our case it expires by the time replication is failed / resumed <or> a switch between replication host names occur).  The replication service has received feedback that the log file in question should not be available for copy, attempts to copy it, and receives an older return status that the file is not ready (even though the file does exist on the source at the time the attempt is made).  In turn the replication service detects this as an error condition and takes action.

    From a Windows 2008 / Windows 2008 R2 perspective this is by design.

    To correct these errors on an Exchange 2007 / Windows 2008 <or> Exchange 2007 / Windows 2008 R2 implementation, the following registry keys should be set to a zero (0) value and the nodes rebooted:

    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanworkstation\Parameters

    FileInfoCacheLifetime [DWORD]

    FileNotFoundCacheLifetime [DWORD]

    DirectoryCacheLifetime [DWORD]

    If the DWORDs are not present they may need to be created.  The recommended value is HEX / DEC 0.

    More information on these keys can be found here: http://technet.microsoft.com/en-us/library/ff686200(WS.10).aspx  (Note that registry path in the article is missing the SERVICES hive – correct path in blog post).

  • Exchange 2010 SP1: StartDagServerMaintenance.ps1 fails when a server contains databases with a single copy.

    In Exchange 2010 Service Pack 1 we introduced some new DAG management scripts.  These scripts can be found in the Exchange Server installation directory \ scripts.  (This is usually c:\Program Files\Microsoft\Exchange Server\v14\scripts).

     

    One of the scripts introduced is the StartDagServerMaintenance.ps1 script.  More information on this script can be found at:

     

    http://technet.microsoft.com/en-us/library/ff625233.aspx

    http://technet.microsoft.com/en-us/library/dd298065.aspx

     

    When administrators utilize this script the following actions are being taken:

    1)  All database copies are moved to another server in the DAG based on the selection of the next best copy.

    2)  If the cluster core resources are owned on the node the resources are arbitrated to a different DAG member (thereby moving the Primary Active Manager functionality to another node).

    3)  The DatabaseCopyAutoActivationPolicy property of the mailbox server is set to a value of BLOCKED thereby preventing the DAG member from receiving or activating database copies.

    4)  The individual database copies hosted on the DAG member are activation suspended.

    5)  The node is paused within the cluster service preventing the cluster core resources from arbitrating to the node (and thereby preventing the node from becoming the Primary Active Manager).

     

    When utilizing a DAG it is not necessary to replicate all databases that exist on DAG members.  It is not uncommon to have standalone databases (databases that are on a DAG member but not replicated to another member) present on a member where the StartDagServerMaintenance.ps1 script will be utilized.  Unfortunately when utilizing the script in its current form in this configuration the script fails to complete its tasks and cannot completely put the node into maintenance mode.   (Only databases are successfully moved off the member).

     

    The administrator may note the following when executing the script on a member that contains a single database copy:

     

    [PS] C:\Program Files\Microsoft\Exchange Server\V14\Scripts>.\StartDagServerMaintenance.ps1 -serverName DAG-1


    The following objects are hosted by 'DAG-1', before attempting to move them off: `n(Primary Active Manager=DAG-1) (Mailbox='Discovery Search Mailbox', Reason='Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database. ) (Mailbox='Journal Internal', Reason='Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database. ) (Mailbox='MicrosoftExchange Approval Assistant', Reason='Arbitration Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database.) (Database='DAG-DB0', Reason='Copy is active'))

    Write-Error : The following objects are still hosted by 'DAG-1', even after attempting to move them off: `n(Mailbox='Discovery Search Mailbox', Reason='Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database. ) (Mailbox='Journal Internal', Reason='Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database. ) (Mailbox='Microsoft Exchange Approval Assistant', Reason='Arbitration Mailbox is hosted on 'DAG-1-DB0', which is not a replicated database. ))
    At C:\Program Files\Microsoft\Exchange Server\V14\Scripts\StartDagServerMaintenance.ps1:216 char:16
    +                 write-error <<<<  ($StartDagServerMaintenance_LocalizedStrings.res_0014 -f ( PrintCriticalMailboxResourcesOutput($criticalMailboxResources)),$shortServerName) -erroraction:stop
        + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
        + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Microsoft.PowerShell.Commands.WriteErrorCommand

     

    If an administrator encounters this condition the following process can be utilized to place the DAG member into maintenance mode.  (In our example server DAG-1 in the DAG named “DAG” [pretty creative eh?] is the server we will be placing in maintenance mode)

     

    1)  Execute a get-mailboxdatabasecopystatus * and verify that at least one other non-lagged copy of each replicated database is healthy.

     

    [PS] C:\>Get-MailboxDatabaseCopyStatus *

    Name                                          Status          CopyQueue ReplayQueue LastInspectedLogTime   ContentIndex
                                                                  Length    Length                             State
    ----                                          ------          --------- ----------- --------------------   ------------
    DAG-1-DB0\DAG-1                               Mounted         0         0                                  Healthy
    DAG-DB0\DAG-1                                 Mounted         0         0                                  Healthy
    DAG-DB1\DAG-1                                 Healthy         0         0           7/13/2011 8:22:55 AM   Healthy
    DAG-2-DB0\DAG-2                               Mounted         0         0                                  Healthy
    DAG-DB1\DAG-2                                 Mounted         0         0                                  Healthy
    DAG-DB0\DAG-2                                 Healthy         0         4           7/13/2011 8:48:34 AM   Healthy
    DAG-DB0\DAG-3                                 Healthy         0         147         7/13/2011 8:48:34 AM   Healthy
    DAG-DB1\DAG-3                                 Healthy         0         140         7/13/2011 8:22:55 AM   Healthy
    DAG-DB0\DAG-4                                 Healthy         0         409         7/13/2011 8:48:34 AM   Healthy
    DAG-DB1\DAG-4                                 Healthy         0         307         7/13/2011 8:22:55 AM   Healthy
    MBX-1-DB0\MBX-1                               Mounted         0         0                                  Healthy
    MBX-1-RDB\MBX-1                               Mounted         0         0                                  Healthy

    2)  Execute a move of all active database copies off the server.  This can be done with the command move-activemailboxdatabase –server <MaintenanceServer>  (Note:  No target server is specified which means the next best copy will be automatically selected for activation)

     

    [PS] C:\>Move-ActiveMailboxDatabase -Server DAG-1 -Confirm:$FALSE

    Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                    tart            nd                                            Objective     AtMoveStart AtMoveEnd
    --------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
    DAG-1-DB1       dag-1           dag-1           Warning                                     Mounted     Mounted
    DAG-1-DB0       dag-1           dag-1           Warning                                     Mounted     Mounted
    DAG-DB0         dag-1           dag-2           Succeeded  0                  7/13/2011 8:5 Mounted     Mounted
                                                                                  3:24 AM
    WARNING: An Active Manager operation failed. Error: The database action failed. Error: You cannot perform a switchover
    operation on database 'DAG-1-DB1' because the database is not configured for replication.. [Database: DAG-1-DB1,
    Server: DAG-1.domain.com]
    WARNING: An Active Manager operation failed. Error: The database action failed. Error: You cannot perform a switchover
    operation on database 'DAG-1-DB0' because the database is not configured for replication.. [Database: DAG-1-DB0,
    Server: DAG-1.domain.com]

    3)  Move the cluster core resources to another node within the DAG.  This can be accomplished using the command cluster.exe <DAGFQDN> group “Cluster Group” /moveto:<NODE>

     

    [PS] C:\>cluster DAG.domain.com group "Cluster Group" /moveto:DAG-2

    Moving resource group 'Cluster Group'...

    Group                Node            Status
    -------------------- --------------- ------
    Cluster Group        DAG-2           Online

    4) Pause the node within the cluster.  This can be done utilizing the command cluster.exe <DAGFQDN> node <NODENAME> /pause

     

    [PS] C:\>cluster DAG.domain.com node DAG-1 /pause

    Pausing node 'DAG-1'...

    Node           Node ID Status
    -------------- ------- ---------------------
    DAG-1                1 Paused

    5) Set the DatabaseCopyAutoActivationPolicy of the server to BLOCKED.  This can be done using the command set-mailboxserver –identity <DAGMember> –databasecopyautoactivationpolicy:BLOCKED

     

    [PS] C:\>Set-MailboxServer -Identity DAG-1 -DatabaseCopyAutoActivationPolicy:BLOCKED

    6) Suspend all individual copies for activation.  This can be done using the command get-mailboxdatabasecopystatus *\<DAGMember> | suspend-mailboxdatabasecopy –activationOnly:$TRUE

     

    [PS] C:\>Get-MailboxDatabaseCopyStatus *\DAG-1 | Suspend-MailboxDatabaseCopy -ActivationOnly:$TRUE
    Database "DAG-1-DB0\DAG-1" has only one copy. This task is supported only for databases that have more than one copy.
        + CategoryInfo          : InvalidOperation: (DAG-1-DB0:ADObjectId) [Suspend-MailboxDatabaseCopy], InvalidOperation
       Exception
        + FullyQualifiedErrorId : 7325D1AB,Microsoft.Exchange.Management.SystemConfigurationTasks.SuspendDatabaseCopy


    Confirm
    Are you sure you want to perform this action?
    Suspending activation of mailbox database copy "DAG-DB0" on server "DAG-1".
    [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): a

     

    At this time it should be safe for the administrator to perform DAG server maintenance.  When the maintenance is complete the script StopDagServerMaintenance.ps1 can be utilized to take the DAG member out of maintenance mode.

  • Exchange 2010: The mystery of the 9223372036854775766 copy queue…

    Database Copy Self-Protection Mode

     

    What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.

     

    But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.

     

    Background

     

    During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLM\Cluster\ExchangeActiveManager\LastLog:

     

    • DatabaseGUID with a decimal value representing the last log generated by the active copy
    • DatabaseGUID_TIME with the system time of when that log file was generated

     

    Here is an example of these entries in the cluster registry:

     

    image

     

    To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:

     

    [PS] C:\Windows\system32>Get-MailboxDatabase dag-db0 | fl name,*guid*


    Name : DAG-DB0
    Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

    These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.

     

    A Larger-than-Life CQL

     

    In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.

     

    We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.

     

    Where Did the Time Go?

     

    So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:

     

    • The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
    • The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
    • The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
    • A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.

     

    What to Do When CQL Reaches 9223372036854775766?

     

    To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.

     

    Attempted move without overrides:

     

    Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

    Confirm
    Are you sure you want to perform this action?
    Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
    [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

    Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                    tart            nd                                            Objective     AtMoveStart AtMoveEnd
    --------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
    DAG-DB0         dag-1           dag-1           Failed                                      Dismounted  Dismounted
    An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy 'DAG-DB0' on server 'DAG-2.domain.com' has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn't mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com]
        + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
        + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase

     

    Successful move with overrides:

     

    Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

    Confirm
    Are you sure you want to perform this action?
    Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
    [Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

    Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                    tart            nd                                            Objective     AtMoveStart AtMoveEnd
    --------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
    DAG-DB0         dag-1           dag-2           Succeeded  922337203685472... 5/29/2012 ... Dismounted  Dismounted

     

    At this point, the database can be mounted on the remote server after moving the active copy.

    Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.