Tim McMichael

Navigating the world of high availability...and occasionally sticking my head in the cloud...

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

Exchange 2010: The mystery of the 9223372036854775766 copy queue…

  • Comments 11
  • Likes

Database Copy Self-Protection Mode

 

What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.

 

But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.

 

Background

 

During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLM\Cluster\ExchangeActiveManager\LastLog:

 

  • DatabaseGUID with a decimal value representing the last log generated by the active copy
  • DatabaseGUID_TIME with the system time of when that log file was generated

 

Here is an example of these entries in the cluster registry:

 

image

 

To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:

 

[PS] C:\Windows\system32>Get-MailboxDatabase dag-db0 | fl name,*guid*


Name : DAG-DB0
Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.

 

A Larger-than-Life CQL

 

In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.

 

We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.

 

Where Did the Time Go?

 

So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:

 

  • The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
  • The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
  • The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
  • A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.

 

What to Do When CQL Reaches 9223372036854775766?

 

To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.

 

Attempted move without overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-1           Failed                                      Dismounted  Dismounted
An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy 'DAG-DB0' on server 'DAG-2.domain.com' has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn't mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com]
    + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
    + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase

 

Successful move with overrides:

 

Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

Confirm
Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-2           Succeeded  922337203685472... 5/29/2012 ... Dismounted  Dismounted

 

At this point, the database can be mounted on the remote server after moving the active copy.

Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.

Comments
  • 50,321 delta from [int64]::maxvalue? where the heck did THAT come from? :-)

  • How long does it take to run the command for 60GB database? It's taking very long time.

  • Database size has nothing to do with this.

    Assuming a copy queue that was zero and a replay queue that was zero the command should complete fairly quickly.

    If there is any more of a replay queue then we end up waiting on log replay to finish before the database can be mounted.  In this case that time is dependent on the number of logs left to replay.

    TIMMCMIC

  • I have the same situation as described.  I have attempted to move the active DB several times with the above switches, but the shell freezes and will not move past "A remote procedure call has been issued to database...". I ran it verbose and it doesn't move past "Resolved current organization", which is right after I confirm the action.  I let it sit for about 8 hours with no activity in the shell or in the event viewer.  This is present on all DBs (28) in a 2 server DAG group, and DBs are mounted, healthy copy status, healthy content index and a 0 replay queue length.  We have half as active on 1 server, and half active on the other.  Our latency is less than 1MS as both servers are located on our property, a local network with a 100Mbit connection about 1 mile apart from each other.  Any ideas on how to proceed?

  • Great article and a good story.

    Just one thing -- the number is equivalient to 9 quadrillion, not 9 quintillion.

  • @James:

    Thanks for posting.  I'm glad you found the story good.

    If my research is correct a quintillion is a X followed by 18 zero values (or 1* 10 to the 18th power).

    For example:

    1 000 000 000 000 000 000

    If you match up the digit placement of a quintillion to the value mentioned above:

    9 223 372 036 854 775 766

    1 000 000 000 000 000 000

    That places the 9 in the positional equivalent of a quintillion.

    By contrast a quadrillion is X followed by 15 zeros.

    1 000 000 000 000 000

    The following references are what I utilized for this:

    en.wikipedia.org/.../Names_of_large_numbers

    math.about.com/.../zeros.htm

    If expanded I believe this would be 9 quintillion, 223 quadrillion, 372 trillion, 36 billion, 854 million, 775 thousand 766 hundred.

    I'm sure we'll become more familiar with quintillion as the national debt continues to grow.  Until then - hopefully this will suffice.

    TIMMCMIC

  • Hi, i have the same issue as Bill, any chance someone over at MS can help out?

    cheers

  • @Kev:

    If you are running into this issue where the databases are ONLINE and not during a datacenter switchover this is usually due to the cluster service not being able to take updates.

    When you issue the move, the PAM attempts to read the cluster registry as well as write to it - and is equally unsuccessful.  For the same reasons that the information store service cannot update the current log file.

    You most likely will not be able to move the database.

    In this case you'll end up killing the cluster service - as it will most likely not respond to restarting the service with service control manager.

    TIMMCMIC

  • Hi

    Thanks! On closer inspection it was the cluster services that had failed, the node that failed held the IP and Name resources too. when I killed the service it did essentially stop all clustering, which was not better than before, so I just ended up shutting them all down and bringing them up one ata time and that fixed it!

    cheers, this is good to know

    Kev

  • Great article. it just saved my hide with a service outage. Many thanks.

  • @Christian...

    Excellent - glad this was helpful.

    TIMMCMIC

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment