What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.
But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.
During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLM\Cluster\ExchangeActiveManager\LastLog:
Here is an example of these entries in the cluster registry:
To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:
[PS] C:\Windows\system32>Get-MailboxDatabase dag-db0 | fl name,*guid*
Name : DAG-DB0 Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698
These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.
In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.
We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.
So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:
To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.
Attempted move without overrides:
Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2
Confirm Are you sure you want to perform this action? Moving mailbox database "DAG-DB0" from server "DAG-1.domain.com" to server "DAG-2.domain.com". [Y] Yes [A] Yes to All [N] No [L] No to All [?] Help (default is "Y"): y
Identity ActiveServerAtS ActiveServerAtE Status NumberOfLogsLost RecoveryPoint MountStatus MountStatus tart nd Objective AtMoveStart AtMoveEnd -------- --------------- --------------- ------ ---------------- ------------- ----------- ----------- DAG-DB0 dag-1 dag-1 Failed Dismounted Dismounted An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy 'DAG-DB0' on server 'DAG-2.domain.com' has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn't mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server: DAG-2.domain.com] + CategoryInfo : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase
Successful move with overrides:
Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT
Identity ActiveServerAtS ActiveServerAtE Status NumberOfLogsLost RecoveryPoint MountStatus MountStatus tart nd Objective AtMoveStart AtMoveEnd -------- --------------- --------------- ------ ---------------- ------------- ----------- ----------- DAG-DB0 dag-1 dag-2 Succeeded 922337203685472... 5/29/2012 ... Dismounted Dismounted
At this point, the database can be mounted on the remote server after moving the active copy.
Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.
50,321 delta from [int64]::maxvalue? where the heck did THAT come from? :-)
How long does it take to run the command for 60GB database? It's taking very long time.
Database size has nothing to do with this.
Assuming a copy queue that was zero and a replay queue that was zero the command should complete fairly quickly.
If there is any more of a replay queue then we end up waiting on log replay to finish before the database can be mounted. In this case that time is dependent on the number of logs left to replay.
I have the same situation as described. I have attempted to move the active DB several times with the above switches, but the shell freezes and will not move past "A remote procedure call has been issued to database...". I ran it verbose and it doesn't move past "Resolved current organization", which is right after I confirm the action. I let it sit for about 8 hours with no activity in the shell or in the event viewer. This is present on all DBs (28) in a 2 server DAG group, and DBs are mounted, healthy copy status, healthy content index and a 0 replay queue length. We have half as active on 1 server, and half active on the other. Our latency is less than 1MS as both servers are located on our property, a local network with a 100Mbit connection about 1 mile apart from each other. Any ideas on how to proceed?
Great article and a good story.
Just one thing -- the number is equivalient to 9 quadrillion, not 9 quintillion.
Thanks for posting. I'm glad you found the story good.
If my research is correct a quintillion is a X followed by 18 zero values (or 1* 10 to the 18th power).
1 000 000 000 000 000 000
If you match up the digit placement of a quintillion to the value mentioned above:
9 223 372 036 854 775 766
That places the 9 in the positional equivalent of a quintillion.
By contrast a quadrillion is X followed by 15 zeros.
1 000 000 000 000 000
The following references are what I utilized for this:
If expanded I believe this would be 9 quintillion, 223 quadrillion, 372 trillion, 36 billion, 854 million, 775 thousand 766 hundred.
I'm sure we'll become more familiar with quintillion as the national debt continues to grow. Until then - hopefully this will suffice.
Hi, i have the same issue as Bill, any chance someone over at MS can help out?
If you are running into this issue where the databases are ONLINE and not during a datacenter switchover this is usually due to the cluster service not being able to take updates.
When you issue the move, the PAM attempts to read the cluster registry as well as write to it - and is equally unsuccessful. For the same reasons that the information store service cannot update the current log file.
You most likely will not be able to move the database.
In this case you'll end up killing the cluster service - as it will most likely not respond to restarting the service with service control manager.
Thanks! On closer inspection it was the cluster services that had failed, the node that failed held the IP and Name resources too. when I killed the service it did essentially stop all clustering, which was not better than before, so I just ended up shutting them all down and bringing them up one ata time and that fixed it!
cheers, this is good to know
Great article. it just saved my hide with a service outage. Many thanks.
Excellent - glad this was helpful.
Muchas gracias me funcionó de maravilla.
We Encountered this issue following Broadcom teaming bug (both adapters marked as deactivated).
Cluster service has restarted and MSExchangeIS crashed.
Problem was resolved by rebooting the impacted servers.
My Customer asked me if this problem would have occured with Exchange 2013.
I positively answered because i guess this protection system also exists in Exchange 2013 DAG.
Do you confirm?
Thanks for posting such an Informative Article!
Now I know why that quintillion number shows up :)
I think in my case, cluster service is not taking the updates in timely manner. One more query, if there are Stale entries in NETFT table, in result node is not able to join the cluster group, do we have any quick option to point out the culprit node (except
reading cluster logs for all nodes) ;)
To my knowledge there's no immediate way to determine who has the lock.