I recently worked with a customer who presented an interesting issue. The customer had installed an Exchange 2007 Cluster Continuous Replication solution where nodes leveraged independent direct attached storage. On the active node of the cluster the customer began to experience storage issues that lead to physical corruption of the Exchange database files. In this case the corruption noted was a –1018. In almost all cases a –1018 is indicative of a storage level error that occurs at a layer lower than the operating system.
In most cases customers would choose to switchover the solution to the passive copies - after all the passive copies are completely independent and physical corruption does not replicate through the log file stream. In this instance that is what the customer choose to do. Although the move appeared successful (in terms of resources moving from NodeA to NodeB) several databases did not mount. Subsequent attempts to manually mount the database also failed. When reviewing the event logs the following events were noted on the passive node (now attempting to become active) for the databases that failed to mount:
Time: 9/22/2011 7:19:59 PM ID: 516
Source: ESE Machine: Node.company.com
Message: Microsoft.Exchange.Cluster.ReplayService (6032) Recovery E03 SGSGCDBG001\SGSGCDBG001-SG21: Database m:\SG21MDB\SGSGCDBG001-SG21-JE-T2.edb: Page 118956 (0x0001d0ac) failed verification due to a timestamp mismatch. The expected timestamp was 0x39cd983 but the actual timestamp on the page was 0x39cdd83. Recovery/restore will fail with error -567. If this condition persists then please restore the database from a previous backup. This problem is likely due to faulty hardware "losing" one or more flushes on this page sometime in the past. Please contact your hardware vendor for further assistance diagnosing the problem.
Time: 9/22/2011 7:19:28 PM ID: 2095
Message: Log file X:\SG38Logs\E250002D510.log in SGSGCDBG001\SGSGCDBG001-SG38 could not be replayed. Re-seeding the passive node is now required. Use the Update-StorageGroupCopy cmdlet in the Exchange Management Shell to perform a re-seed operation.
Time: 9/22/2011 7:19:28 PM ID: 2097
Machine: Node.company.com Message: The Microsoft Exchange Replication Service encountered an unexpected Extensible Storage Engine (ESE) exception in storage group 'SGSGCDBG001\SGSGCDBG001-SG38'. The ESE exception is dbtime on page in advance of the dbtimeBefore in record (-567) ().
When utilizing the ERR tool we note that the error –567 is:
# for decimal -567 / hex 0xfffffdc9 JET_errDbTimeTooNew esent98.h # /* dbtime on page in advance of the dbtimeBefore in record # */ # 1 matches found for "-567"
In general DBTime errors are usually attributed also to failing storage. In this case though something did not seem right – would it be possible to have two independent storage devices experiences failures almost simultaneously. In all honesty it’s not outside the realm of possibility that those circumstances may have actually occurred. Unlike the one set of storage which was clearly displaying hardware faults on the storage devices the second storage device displayed no faults at all. There was no way to attribute this particular error to a local storage failure.
Is it possible that a storage failure on the previously active node caused corruption in a passive database copy? In this case the answer is…yes…
First and foremost one must draw a distinction between the types of corruption we can have within an Exchange database. You have physical corruption and you have logical corruption. Logical corruption is the type of corruption that will replicate through the log stream and subsequently may cause corruption within a passive database copy. In this case DBTime errors are considered logical corruption. Therefore, this logical corruption is introduced into the log file stream and subsequently replicated over to the passive node. In terms of DBTime corruption this will halt logging recovery of the database instance and prevent it from being a healthy replica (when not being activated) or preventing mounting the database (when being activated).
So how is it possible that this type of logical corruption occurred because of a storage failure on the source copy? Let’s investigate…
DBTime is a counter within an Exchange database that is incremented for each page level change that occurs – think of DBTime as the odometer in your car. If one determines the highest DBTime, and locates the page within the database that matches that DBTime, they have identified the last page changed within the database. When a page change occurs and that change is recorded to the log file we record three pieces of important information (although there is much more recorded). This information includes:
1) The page number that is being changed.
2) The current DBTime on that page.
3) The new DBTime for the page.
When logging recovery occurs the following rules are applied where DBTime is concerned:
1) Does the page in the log record exist in the database?
2) Is the previous DBTime recorded in the log record for the page match the current DBTime stamped on the page?
3) Is the new DBTime recorded in the log record for the page greater than the current DBTime stamped on the page?
When all page modifications have occurred to a database in the correct order then the three rules are always met. In this case the JET_errDBTimeTooNew occurs when check number 2 fails. Simply put, when I compare the current DBTime of the page recorded in the log record to the current DBTime stamped on the same page within the database, the database has a newer value. Therefore if I were to allow my change to be committed data loss would occur – logging recovery is halted.
Still – how did this happen because of storage issues on the previously active node? Let’s explore that further…
So let’s assume we can isolate all changes to a single page. In this case we will say page 45. Page 45 has a current DBTime of 400:
At this time we make a change to page 45. The log record records page 45, current DBTime 400, and new DBTime of 410. When the change is committed to the database the DBTime on the page is updated to the new DBTime.
Another page change to page 45 occurs. The log record records page 45, current DBTime 410, and new DBTime of 500. When the change is committed to the database the DBTime on the page is updated to the new DBTime.
Now this is where the storage issue is introduced. In this case Exchange was informed by the storage subsystem that the write to the database was successful. When this information was processed Exchange purged the change from the database cache. Unfortunately the write was “lost” and therefore the DBTime on the page never really updated to reflect the correct values. So now a new change to page 45 is issued. The log record records page 45, current DBTime 410, and new dbTime 600. (See the issue – the current DBTime on the page should have reflected 500 but because of the lost flush reflects 410).
At this time the administrator moves the clustered mailbox server to the passive node. The log file referenced in this example now needs to be replayed into the database. Remember the rules above for log replay in terms of DB Time. So in our example the first log record is encountered:
Does page 45 exist = Yes.
Is the current dbTime on the page 400 = Yes.
Is the new dbTime greater than current dbTime on the page = Yes.
Transaction allowed to proceed = Yes.
The second log record is now evaluated.
Is the current dbTime on the page 410 = Yes.
Note that the dbTime was correctly updated to 500. This was the previous write that was lost to the original database (and therefore the dbTime was never updated in the original database).
The third log record is now evaluated.
Is the current dbTime on the page 410 = NO
Is the new dbTIME greater than current dbTime on the page = Yes.
Transaction allowed to proceed = NO JET_errDBTimeTooNew
As you can see the dbTime on the page which is currently 500 is greater (or newer) then the previous dbTime recorded for the page from the previous active database (410). Therefore the commit is blocked and logging recovery fails. In our case this prevented the CMS from coming online on the second node.
To rectify this situation we would have normally just moved the CMS back to the original node and then reseeded these database copies (therefore bypassing the need to replay the log file in question). In this case a restoration was necessary since both database copies were essentially unusable due to the mix of corruption present. Unfortunately this type of corruption is only ever detected during logging recovery – there is no proactive way to detect this form of logical corruption.
Would this same type of corruption potentially be able to could occur with Exchange 2010 DAG-based replication?
Thanks for sharing it. Its very useful who are working in DAG.
I think to prevent above situation we can configure lagged copy for 2 days or 1 day, so we can avoid any logical corruption.
I'm not fully confident that a lagged copy would have helped here - the reason being that we'd need to roll forward all the logs in order to have the lagged copy be up to date. If one were willing to truncate the log sequence this might be possible - but with data loss.
Quite a scary scenario, isn't it...
Is such a corruption is still possible with Exchange 2013 DAG replication?
How to fix it? Locate the corrupted mailbox and move it to another DB?
I mean - is it possible to fix the corruption in the active DB?
The corruption should eventually fix itself as it becomes no longer necessary to replay through the log file that references the page or another page update occurs.