Recently one of our Exchange MVPs posted a question to our support engineers…Is it necessary to move the cluster group / cluster core resources in Exchange 2007 / Exchange 2010?
As a reminder…the cluster core resources are generally the cluster name and cluster IP address. This is the management name specified when the cluster is created.
In Windows 2003 this would be known as the “Cluster Group” and would contain the cluster name, cluster IP address, and majority node set or quorum disk resource.
In Windows 2008 and Windows 2008 R2 this would be known as the “Cluster Core Resources” (the actual group in cluster though is still called the “Cluster Group”) and generally contains the cluster name, cluster IP address, and file share witness or quorum disk resource.
In Windows 2003 the cluster group is visible and manageable in cluster administrator. In Windows 2008 you can view the cluster core resources, but management requires using the cluster.exe command line (or powershell in Windows 2008 R2).
So…enough background and back to the question…
From the platforms perspective it should not be necessary to maintain either of these group or move the cluster core resources when rebooting nodes etc. When the resources are owned on a node where the cluster service is stopping, the resources will automatically be arbitrated to another node in the cluster.
I personally though do not prefer to take this approach. I prefer to know that the resources have been moved off the node that I’m managing and have been successfully arbitrated to another node in the cluster.
Therefore, I recommend the following…
Prior to managing a node verify whether or not the node owns the cluster core resources.
Run the following command – cluster.exe <FQDNCluster> group
C:\>cluster dag.exchange.msft group Listing status for all available resource groups:
Group Node Status -------------------- --------------- ------ Cluster Group DAG-3 Online Available Storage DAG-1 Offline
As you can see from the output the “Cluster Group” currently resides on DAG-3 and is online.
If DAG-3 is the node to be rebooted, the cluster group can be moved to another node.
Run the following command – cluster.exe <FQDN> group “Cluster Group” /moveto:<NODE>
Here is an example of where the resources did not arbitrate successfully. You can see that the group successfully moved to the specified node, but the status is partially online.
C:\>cluster.exe dag.exchange.msft group "Cluster Group" /moveto:DAG-1
Moving resource group 'Cluster Group'...
Group Node Status -------------------- --------------- ------ Cluster Group DAG-1 Partially Online
Any other status but online should be investigated and corrected prior to rebooting or managing the cluster services on the nodes.
Here is an example of a successful move.
C:\>cluster.exe dag.Exchange.msft group "Cluster Group" /moveto:DAG-1
Group Node Status -------------------- --------------- ------ Cluster Group DAG-1 Online
If the group successfully moved, the node information should display the name of the node where the group was moved and the status should show online.
In Exchange 2010 when using the Database Availability Group (DAG) we leverage the cluster services in Windows 2008 and Windows 2008 R2.
When utilizing the cluster services in Windows 2008 and Windows 2008 R2 the cluster core resources – cluster name is a Kerberos enabled name. This requires that a machine account be created within the directory for association with this cluster name resource. This is known as the CNO or cluster name object.
With previous versions of Exchange this Kerberos enabled machine account was created in the domain leveraging the credentials / rights assigned to the user logged into the machine using the failover cluster management tools to establish the clustered services. This would require either the user’s account have permissions to create and enable machine accounts within the domain or the machine account to be pre-staged with the appropriate rights assigned.
In Exchange 2010 the establishment of the cluster and the cluster name object is performed by using either the Exchange Management Console or Exchange Management Shell. Specifically the cluster creation occurs when running the command add-databaseavailabilitygroupserver when adding the first server to the database availability group. Since we are leveraging the Exchange management tools to establish the cluster service, this also means that we are utilizing RBAC. In this case when the first server is added to the DAG, the remote powershell contacts the replication service of the first node to be added. In turn, the replication service of that local machine installs the failover cluster management service and begins cluster creation utilizing the credentials assigned to LOCAL SYSTEM (since the replication service starts on the machine in the LOCAL SYSTEM security context). Therefore, the individual user rights to create machine accounts in the domain are not leveraged when establishing the clustered services for the database availability group, but rather that of the replication service.
In environments where computer account creation is restricted, it may become necessary to pre-stage the CNO for the clustered services and assign the appropriate rights. There are two methods which work to establish this security context:
1) Assign the machine account of the first node added to the DAG with full control of the pre-staged object.
2) Assign the Exchange Trusted Subsystem universal security group with full control of the pre-staged object.
The first method works by ensuring that the LOCAL SYSTEM security context will be able to manage the pre-staged computer account fully. The second method works because the Exchange Trusted Subsystem group contains the machine accounts of all Exchange servers in the domain. If utilizing method one you need to ensure that the first member added to the DAG is the machine granted permissions. If the second method is utilized any DAG member can be added as all members have rights to manage the account.
Here is an example of how to pre-stage the machine account utilizing the local system rights of the first DAG member.
1) Select the appropriate container in Active Directory where you wish the account to be created. Right click, select NEW –> COMPUTER.
2) In the Computer Name field, type the name that was assigned to the database availability group.
3) After the account is created, right click on the account and select properties. Select the security tab. (Note: You may have to enable advanced features in Active Directory Users and Computers).
4) Select the ADD button to present the add dialog.
5) Select the Object Types button. De-select all options and select the COMPUTERS option. Press OK.
6) In the Enter the object name to select field, type the name of the first DAG member. Select Check Name to ensure the name resolves to a valid account in the directory. Select OK to add the account.
7) Select the machine account in the Group or User Names field, assign the machine account full control. Press OK.
8) Locate the account in the container where it was created. Right click on the account and select disable account.
9) When presented with the disable dialog, select YES.
10) After the object has been successfully disabled, press the OK button on the success confirmation.
11) At this point allow time for active directory replication to occur. Once replication has completed, the first database availability group member can be added and the cluster services established.
======================================
Update 3/14/2010
It was recently asked how can I programmatically achieve establishing these permissions? Thanks to Jeff Kizner and Robert Gillies for providing these instructions.
cd ad:
$comp = Get-ADComputer DAG01
$sid = (Get-ADGroup "Exchange Trusted Subsystem").sid
$rights = [System.DirectoryServices.ActiveDirectoryRights]::GenericAll
$perm = [System.Security.AccessControl.AccessControlType]::Allow
$acl = get-acl $comp
$ace = new-object System.DirectoryServices.ActiveDirectoryAccessRule $sid, $rights, $perm
$acl.AddAccessRule($ace)
set-acl -AclObject $acl $comp
The official TechNet documentation for establishing a Windows 2003 / Exchange 2007 SP1 cluster can be found at:
What I want to do in this blog post is outline an alternate method for installing the cluster different from the instructions referenced above for Windows 2003 clustered nodes.
The first recommendation that I make when establishing the cluster service is to not use the GUI mode setup included with Exchange 2007. My main rational for this advice is that when GUI mode setup fails you generally have to revert to command line setup to correct conditions and repeat the setup operations. With this in mind it’s much easier to start with the command line setup and continue through.
In each of the links above you will see that the basic structure of a command line setup for the active roles installation is:
setup.com /mode:install /roles:mailbox /newCMS /cmsName /cmsIPAddress [/css] [/cmsDataPath]
The passive role installation is:
setup.com /mode:install /roles:mailbox
The way to think about the active role command is the combination of the passive role (setup.com /mode:install /roles:mailbox) with the establishment of the clustered services (setup.com /newCMS /cmsName /cmsIPAddress [/css] [/cmsDataPath]).
These commands do not have to be run together. With this in mind my second recommendation is to establish the passive role first on each clustered node. I recommend this for a few reasons:
Now I will outline for you the steps that I use when establishing a Windows 2003 / Exchange 2007 SP1 Cluster. Please refer to these additional blog posts for some other helpful information regarding these installations:
In all instructions below the assumption is that the cluster service and appropriate quorum type have been established and fully configured. The instructions also outline only the establishment of the Exchange resources, the TechNet documentation referenced above should be used to establish the specific cluster service requirements and other installation / management tasks for an Exchange 2007 SP1 clustered installation on Windows 2003.
Windows 2003 / Exchange 2007 SP1 Cluster Continuous Replication (CCR)
At this point the replication service will begin copying log files from the new databases created and bring both sides into synchronization. This completes the setup of a Windows 2003 / Exchange 2007 SP1 CCR cluster.
Windows 2003 / Exchange 2007 SP1 Single Copy Cluster (SCC)
By using these steps you should hopefully be able to simply your Exchange 2007 SP1 clustered installation on Windows 2003 and more easily address errors that may arise during the clustered setup.
The official TechNet documentation for establishing a Windows 2008 / Exchange 2007 SP1 cluster can be found at:
Now I will outline for you the steps that I use when establishing a Windows 2008 / Exchange 2007 SP1 Cluster. Please refer to these additional blog posts for some other helpful information regarding these installations:
In all instructions below the assumption is that the cluster service and appropriate quorum type have been established and fully configured. The instructions also outline only the establishment of the Exchange resources, the TechNet documentation referenced above should be used to establish the specific cluster service requirements and other installation / management tasks for an Exchange 2007 SP1 clustered installation on Windows 2008.
Windows 2008 / Exchange 2007 SP1 Cluster Continuous Replication (CCR)
At this point the replication service will begin copying log files from the new databases created and bring both sides into synchronization. This completes the setup of a Windows 2008 / Exchange 2007 SP1 CCR cluster.
Windows 2008 / Exchange 2007 SP1 Single Copy Cluster (SCC)
By using these steps you should hopefully be able to simply your Exchange 2007 SP1 clustered installation on Windows 2008 and more easily address errors that may arise during the clustered setup.
In order for Exchange to function correctly administrators are generally advised to make two important configuration changes to their file level antivirus scanners:
When antivirus exclusions are not properly set servers may experience performance issues or database availability issues.
Today I worked an interesting issue with a customer. A common issue when antivirus settings are not set correctly is that the ENNtmp.log file in the Exchange database log directory is deleted or quarantined during the rename procedure. Remember that while a database is mounted any current transactions are being written to the ENN.log file. When the ENN.log file is full, the lock on it is released and the file is renamed to it’s full name (for example ENN00001af.log). While writing to the ENN.log, the next log file in the series is being built as the ENNtmp.log. When the ENN.log is full, and that rename is occurring another rename operation is occurring – renaming the ENNtmp.log to ENN.log. As locks are released on these files to facilitate renaming etc – file level antivirus when mis-configured can interject itself and operate against these files. In this case the file was deleted as a virus and the database forcefully dismounted.
In our case events were clearly logged by the antivirus scanner indicating an action was performed on the log file in question. This naturally lead us to reviewing the file level antivirus exclusions.
The structure of the server had all log directories stored on L:\ and all databases stored on M:\. When the file system exclusions were reviewed, you could clearly see that exclusions for L and M existed and were set to include sub-directories.
Upon further investigation it was determined that the log file directories though were actually mountpoints. For example, L:\SG1 where SG1 is a mounted volume as a mountpoint.
From here it was determined that the antivirus scanner was actually looking at things at the volume / disk level. Therefore, L:\ exclusions did not apply since SG1 was a mounted physical disk, and not a folder on the file system.
Given the above example:
A folder L:\Folder would be excluded since it’s a folder existing on the L file system.
A mountpoint L:\MountPoint would not be excluded since L is a physical disk and MountPoint is a physical disk – and no exclusion existed specifically for the mounted physical disk.
A folder L:\Folder\Mountpoint exists. In this case the L:\Folder would be excluded, but the L:\Folder\Mountpoint would not be excluded since the Mountpoint is a mounted physical disk.
The moral of the story…please make sure you are following up with your antivirus vendor on the appropriate method to apply exclusions when files exist on both lettered physical disks and mounted physical disks as mountpoints.
It is not uncommon in many installations today, especially clustered installations, to see the use of mountpoints.
When using mountpoints with clustering it is a requirement that the physical disk resources created for the mountpoints be dependant on the lettered physical volume that hosts them.
The use of mountpoints also requires that the lettered physical volume hosting them be available at all times. Exchange uses the full path – for example X:\MountPoint\data…
I wanted to highlight a design consideration that I encourage you to consider when utilizing mountpoints.
I see two common designs where mountpoints are utilized.
These two designs do give logical ( and possibly physical depending on storage design )separation to the storage for databases and the storage for log files. It also limits the amount of lettered physical volumes that are required to support the desired solution.
The issue it introduces though is a single point of failure.
In the first example in order for the mountpoints to function you have to have the X drive always available. If the X drive is lost, all mountpoints associated with the X drive are lost, and the database instances dependant on those mountpoints are lost.
In the second example you have the same issue as in the first example where a single point of failure exists. A loss of either the X or Y disk causes mounpoints associated on those disks to be unavailable. Since an Exchange database instance requires dependencies on both the mountpoint hosting the database, and the mountpoint hosting the logs, the database instance becomes unavailable.
You can increase the availability of the overall solution by spreading out the mountpoints over a series of lettered volumes. For example, let’s take a solution there there will be 25 storage groups. You could in this instance create 5 lettered volumes. Off of each of these 5 lettered volumes you would create 10 mountpoints. Each of the mountpoint pairs would serve as the database and log drives for a single database instance. That would mean that the loss of a single lettered volume would only affect users hosted in 5 of the databases in the solution, verses all 25 databases being affected when using the design outlined in either of the previous examples.
Here is a small picture example of this concept.
Another consideration is that when a root volume hosting mount points is lost and subsequently replaced, the administrator must manually recreate the necessary folder structure and map the mount points back. By spreading this out, you can hopefully eliminate the amount of work that would be necessary to restore the solution.
By reducing the single points of failure you can increase the overall availability of the solution.
When running ExBPA against an Exchange 2007 cluster, the following error text may be noted:
===========================================================
Error: 'ClusteredMailboxServer' is partially configured
Server: <SERVER>
'ClusteredMailboxServer' is partially configured on server <SERVER>.contoso.com. Either setup is configuring it right now or setup failed while configuring it. Please rerun setup if it isn't currently running.
The link text from ExBPA advises users on how to correct this issue:
Possible CMS Setup Failure Send Feedback
Microsoft Exchange Server Analyzer - Articles > Exchange Cluster
Topic Last Modified: 2008-04-01
The Microsoft® Exchange Server Analyzer Tool reads the following registry branch for the presence of the Action or Watermark registry values:
HKEY_LOCAL_MACHINE\Software\Microsoft\Exchange\v8.0\ClusteredMailboxServer
If the Exchange Server Analyzer finds that either the Action or Watermark registry values are present, an error is displayed.
The presence of the Action or Watermark registry values on this registry branch indicates that either there was a previously unsuccessful installation of the Cluster Mailbox Server (CMS) to the cluster node, or that the CMS installation was in progress when the Exchange Server Analyzer was doing its analysis.
If the CMS installation failed on this node but was successful on other nodes of the cluster, it could appear that the cluster network is functioning normally when that is not the case.
To address this error, determine whether CMS installation is still in progress on the affected node and, if not, rerun setup to complete the failed CMS installation.
One of the ExBPA rules checks the registry to find out the install state of roles on the server. Exchange 2007 uses watermark keys to track setup progress. If a failure is encountered during setup, when setup is rerun the watermark is consulted and setup resumed at or near the point of failure. When setup completes successfully the watermark value is cleared.
The following is a sample export of a clustered server mailbox role watermark.
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Exchange\v8.0\ClusteredMailboxServer] "Action"="Install" "Watermark"=dword:7448a073
It is not uncommon to see this rule triggered against clustered nodes – especially if there at one point was a setup failure. Sometimes during installation when a failure is encountered and the clustered mailbox server is partially configured the resources created will be moved between nodes. Setup is then run on the other node in hopes that it will complete successfully. In some cases it will complete successfully and in other cases it will fail.
The watermark is not checkpointed / replicated between the registry of the nodes. In the event that setup fails each node will have a watermark indicating the point in the installation where the failure occurred. Should one of the nodes complete setup successfully, one node will have it’s watermark cleared while the other retains the watermark.
The presence of this key triggers the ExBPA rule.
In order to tell what actions need to be taken we must first determine if installation completed successfully. For this we can turn to the Exchange setup logs that are created on each node. We will review the logs and see if it can be determined if setup completed successfully.
The setup logs are located at c:\ExchangeSetupLogs.
We are looking for the file ExchangeSetup.log. This file can be opened in a text viewer.
If the following text can be found in the setup log it should be safe to assume that setup completed successfully for that CMS on at least one of the nodes.
The start of the clustered mailbox server installation is indicated in the logs by the following series of text (or similar depending on single copy cluster or cluster continuous replication installation):
<DATE> [0] Setup will run the task 'new-ClusteredMailboxServer' <DATE> [1] Setup launched task 'new-ClusteredMailboxServer -DomainController DC' -PublicFolderDatabase $false -updatesdir 'Y:\x64\Updates' -Name <CMSNAME> -IPAddress <IPAddress> -IPv4Addresses $null -IPv4Networks $null -IPv6Networks $null -SharedStorage $false -DataPath <DATAPATH>
If you continue to scroll through the log and reach the following text the setup command completed successfully.
<DATE> [0] The Microsoft Exchange Server setup operation completed successfully. <DATE> [0] End of Setup
Once you have verified that the command was run and completed successfully, the watermark can be exported and safely removed using registry editor.
If for any reason you cannot verify that setup finished for this CMS instance on at least one of the nodes, the only safe procedure for removing the watermark is to allow setup to complete successfully. (This needs to occur even if the resources are already online and even if they are servicing users).
If you have the setup log this can be an easy task. The parameters that were used for setup are also logged in this log. The following line will tell you what parameters were used when establishing the clustered mailbox server installation.
<DATE> [0] ExSetup was started with the following command: '/NoLogo /newCMS /cmsName:2003-MBX3 /cmsIPAddress:192.168.0.14 /sourcedir:Y:\x64 /FromSetup'.
In this case you can see that the CMSName parameter used is 2003-MBX3, the CMSIpaddress is 192.168.0.4. No additional switches (like CSS and CMSDataPath) indicate this was a CCR installation.
If you do not have the setup log you will most likely need to use cluster administrator to get the necessary information. Using cluster administrator locate the CMS Network Name resource and CMS IPv4 Resource. Make note of the values of each of these resources.
Once you have the values necessary to complete the command the CMS must be moved to the node where the watermark exists. If the CMS is not currently on that node, issue a move-clusteredmailboxserver using the Exchange Management Shell.
Once the CMS has been moved to the node with the watermark, run setup using the command line setup with appropriate values. Using my example above:
setup.com /newCMS /cmsName:2003-MBX3 /cmsIPAddress:192.168.0.4
Setup will consult the watermark on the machine and when it completes successfully clear the watermark value. This will ensure that the CMS is now fully configured to support Exchange.
There may arise times where it is necessary to completely reseed an LCR, CCR, or SCR database copy from the source database. In order to reseed the copy we use the update-storagegroupcopy commandlet.
When the update-storagegroupcopy is run, the database is pulled using the ESE backup online streaming API from the source machine to the target machine. If the database is successfully copied without error the replication instance is automatically resumed. No log files are pulled or copied as a part of the update-storagegroupcopy process. It is not until after the update-storagegroupcopy process is completed, and replication is resumed, that the header of the database is reviewed and the replication service determines which logs are necessary to be copied.
In this blog post I want to highlight how the replication service makes the decision on which log files need to be copied post a re-seed of the database. I will use examples from cluster.replay tracing (which can only be done with consultation with product support services).
*Databases copied offline between servers (clean shutdown).
When databases are copied offline between nodes this is a manual seeding operation. By default a database that is offline and copied between nodes is in a clean shutdown state.
Here is a sample header dump.
Extensible Storage Engine Utilities for Microsoft(R) Exchange Server
Version 08.01
Copyright (C) Microsoft Corporation. All Rights Reserved.
Initiating FILE DUMP mode... Database: 2003-MBX3-SG1-DB1.edb
File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:08/09/2009 14:10:12 Rand:7948610 Computer: cbDbPage: 8192 dbtime: 20053 (0x4e55) State: Clean Shutdown Log Required: 0-0 (0x0-0x0) Log Committed: 0-0 (0x0-0x0) Streaming File: No Shadowed: Yes Last Objid: 133 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 0 Repair Date: 00/00/1900 00:00:00 Old Repair Count: 0 Last Consistent: (0x9,A,1C1) 08/09/2009 14:12:18 Last Attach: (0x8,9,86) 08/09/2009 14:12:15 Last Detach: (0x9,A,1C1) 08/09/2009 14:12:18 Dbid: 1 Log Signature: Create time:08/09/2009 14:10:08 Rand:7930576 Computer: OS Version: (5.2.3790 SP 2)
Previous Full Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
Previous Incremental Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
Previous Copy Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
Previous Differential Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
Current Full Backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
Current Shadow copy backup: Log Gen: 0-0 (0x0-0x0) Mark: (0x0,0,0) Mark: 00/00/1900 00:00:00
cpgUpgrade55Format: 0 cpgUpgradeFreePages: 0 cpgUpgradeSpaceMapPages: 0
ECC Fix Success Count: none Old ECC Fix Success Count: none ECC Fix Error Count: none Old ECC Fix Error Count: none Bad Checksum Error Count: none Old bad Checksum Error Count: none
Operation completed successfully in 0.63 seconds.
When replication is resumed, the header of the database is consulted. Here is an example trace tag from non-customer viewable tracing.
2826 74006100440074 2256 Cluster.Replay FileChecker RunChecks is successful. FileState is: LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 0 HighestGenerationRequired: 0 LastGenerationBackedUp: 0 CheckpointGeneration: 0 LogfileSignature: LatestFullBackupTime: LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: SnapshotLatestFullBackup: SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: True 2827 74006100440074 2256 Cluster.Replay ReplicaInstance SetReplayState(): LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 0 HighestGenerationRequired: 0 LastGenerationBackedUp: 0 CheckpointGeneration: 0 LogfileSignature: LatestFullBackupTime: LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: SnapshotLatestFullBackup: SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: True
You can see here that the replication service, after reading the status of the database, has detected that a clean shutdown database was found. (ConsisntentDatabase: True).
Since no backup has been performed on the database, and the log directory on the target was empty prior to resuming the storage group copy, the replication service determines that no minimum log file was necessary. Log copy will start from the first log available on the source server and continue to the highest generation on the source server. As long as the generation is contiguous replication will proceed and remain healthy post the manual database seed. (It would be best practice with an offline reseed to clear the log directory on the target prior to resuming the database copy).
2871 030F3F44 2256 Cluster.Replay ReplicaInstance No logfiles present, no backup information, no required generation 2872 030F3F44 2256 Cluster.Replay ReplicaInstance Log copying will start from generation 0
The replication service begins the replication instance and queries the source directory, in this case it was determined that log file 5 was the first available, and thus the replication service starts by copying this log file.
2903 0385717B 2256 Cluster.Replay NetPath ClusterPathManager.GetPath() returns \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$ 2904 007D2DBB 2256 Cluster.Replay LogCopy First generation for \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$ is 00000005 2905 0385717B 2256 Cluster.Replay NetPath ClusterPathManager.GetPath() returns \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$ 2906 007D2DBB 2256 Cluster.Replay PFD PFD CRS 18907 First generation for \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$ is 00000005 2907 0385717B 2256 Cluster.Replay NetPath ClusterPathManager.GetPath() returns \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$ 2908 007D2DBB 2256 Cluster.Replay ShipLog LogCopy: Trying to find file \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$\E0000000005.log 2909 007D2DBB 2256 Cluster.Replay ShipLog LogCopy: Found file E0000000005.log 2910 007D2DBB 2256 Cluster.Replay PFD PFD CRS 18395 LogCopy: Found file E0000000005.log
This is also confirmed by the event in the application log indicating that log copy began by successfully copying log generation 5 (0x5).
Event Type: Information Event Source: MSExchangeRepl Event Category: Service Event ID: 2114 Date: 8/9/2009 Time: 2:15:22 PM User: N/A Computer: 2003-NODE2 Description: The replication instance for storage group 2003-MBX3\2003-MBX3-SG1 has started copying transaction log files. The first log file successfully copied was generation 5.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
At minimum we must have all logs from the Last Consistent log generation forward in order to maintain replication. This makes sense, if I did not have all logs from the last consistent log (where the database was shutdown) forward, how could I bring the passive copy up to current point in time?
As long as the generation is contiguous and logs are present on the source from last consistent to current log replication will proceed and remain healthy post the manual database seed.
*Database seeded using Update-StorageGroupCopy where no full or incremental backup was performed.
In this example we have a database on a source server that has neither had a full or incremental backup performed on it. The storage group replication between nodes was suspended using suspend-storagegroupcopy. Then the update-storagegroupcopy command was used to stream the database to the target server (the –manualResume switch was also used so I could generate the header dumps). Below is a sample header dump of a database post an update-storagegroupcopy.
File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:08/09/2009 14:10:12 Rand:7948610 Computer: cbDbPage: 8192 dbtime: 20053 (0x4e55) State: Dirty Shutdown Log Required: 11-11 (0xb-0xb) Log Committed: 0-14 (0x0-0xe) Streaming File: No Shadowed: Yes Last Objid: 133 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 0 Repair Date: 00/00/1900 00:00:00 Old Repair Count: 0 Last Consistent: (0x9,A,1C1) 08/09/2009 14:12:18 Last Attach: (0xB,9,86) 08/09/2009 14:28:20 Last Detach: (0x0,0,0) 00/00/1900 00:00:00 Dbid: 1 Log Signature: Create time:08/09/2009 14:10:08 Rand:7930576 Computer: OS Version: (5.2.3790 SP 2)
Current Full Backup: Log Gen: 11-14 (0xb-0xe) Mark: (0xE,188,167) Mark: 08/09/2009 14:29:54
Operation completed successfully in 0.31 seconds.
In this header dump you will notice that the database is in dirty shutdown. This is expected of a database that has come from an online seeding operation. You will also note that the Current Full Backup header section of the database is populated. The low log value here is 11 (0xb) and the high log value is 14 (0xe).
After the header dump was generated I resumed storage group copy (normally after a successful update-storagegroupcopy this is done for you automatically). When replication is resumed the header of the database is consulted. Here is a sample output from non-customer viewable tracing.
5107 61007400610044 2256 Cluster.Replay FileChecker RunChecks is successful. FileState is: LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 11 HighestGenerationRequired: 11 LastGenerationBackedUp: 0 CheckpointGeneration: 0 LogfileSignature: MJET_SIGNATURE(Random = 7930576,CreationTime = 8/9/2009 2:10:08 PM) LatestFullBackupTime: LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: SnapshotLatestFullBackup: SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: False 5108 61007400610044 2256 Cluster.Replay ReplicaInstance SetReplayState(): LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 11 HighestGenerationRequired: 11 LastGenerationBackedUp: 0 CheckpointGeneration: 0 LogfileSignature: MJET_SIGNATURE(Random = 7930576,CreationTime = 8/9/2009 2:10:08 PM) LatestFullBackupTime: LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: SnapshotLatestFullBackup: SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: False
You will note from this output that the HighestGenerationRequired and LowestGenerationRequired is 11 (0xb). This is based on the current full backup information in the header of the database. The lowest log recorded in current full backup represents the lowest log necessary to complete the source database at the time the update-storagegroupcopy was run.
You will note that the events in the application log indicate log copy started with logs 11 (0xb).
Event Type: Information Event Source: MSExchangeRepl Event Category: Service Event ID: 2114 Date: 8/9/2009 Time: 2:36:00 PM User: N/A Computer: 2003-NODE2 Description: The replication instance for storage group 2003-MBX3\2003-MBX3-SG1 has started copying transaction log files. The first log file successfully copied was generation 11.
Post an update-storagegroupcopy replication will remain healthy pending that all logs are present and contiguous on the source server from the time the update-storagegroupcopy was initiated until it completed successfully.
*Database seeded using Update-StorageGroupCopy where a full backup was performed on the source database.
In this example we have a database on a source server that has had a full backup performed on it (in this case an ESE online streaming backup). The storage group replication between nodes was suspended using suspend-storagegroupcopy. Then the update-storagegroupcopy command was used to stream the database to the target server (the –manualResume switch was also used so I could generate the header dumps). Below is a sample header dump of a database post an update-storagegroupcopy.
File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:08/09/2009 14:10:12 Rand:7948610 Computer: cbDbPage: 8192 dbtime: 22631 (0x5867) State: Dirty Shutdown Log Required: 31-31 (0x1f-0x1f) Log Committed: 0-32 (0x0-0x20) Streaming File: No Shadowed: Yes Last Objid: 134 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 0 Repair Date: 00/00/1900 00:00:00 Old Repair Count: 0 Last Consistent: (0x1D,A,1C1) 08/09/2009 14:48:13 Last Attach: (0x1F,9,86) 08/09/2009 14:48:15 Last Detach: (0x0,0,0) 00/00/1900 00:00:00 Dbid: 1 Log Signature: Create time:08/09/2009 14:10:08 Rand:7930576 Computer: OS Version: (5.2.3790 SP 2)
Previous Full Backup: Log Gen: 20-21 (0x14-0x15) Mark: (0x15,D,195) Mark: 08/09/2009 14:45:43
Current Full Backup: Log Gen: 31-32 (0x1f-0x20) Mark: (0x20,E,185) Mark: 08/09/2009 14:49:06
Operation completed successfully in 0.46 seconds.
In this header dump you will note that Previous Full Backup is populated. The low log generation is 20 (0x14) and the high log generation is 21 (0x15).
6593 61007400610044 2472 Cluster.Replay FileChecker RunChecks is successful. FileState is: LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 31 HighestGenerationRequired: 31 LastGenerationBackedUp: 21 CheckpointGeneration: 0 LogfileSignature: MJET_SIGNATURE(Random = 7930576,CreationTime = 8/9/2009 2:10:08 PM) LatestFullBackupTime: 8/9/2009 2:45:43 PM LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: False SnapshotLatestFullBackup: False SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: False 6594 61007400610044 2472 Cluster.Replay ReplicaInstance SetReplayState(): LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 31 HighestGenerationRequired: 31 LastGenerationBackedUp: 21 CheckpointGeneration: 0 LogfileSignature: MJET_SIGNATURE(Random = 7930576,CreationTime = 8/9/2009 2:10:08 PM) LatestFullBackupTime: 8/9/2009 2:45:43 PM LatestIncrementalBackupTime: LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: False SnapshotLatestFullBackup: False SnapshotLatestIncrementalBackup: SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: False
In this output you will note that LastGenerationBackedUp is 21 (0x14). This corresponds to the high log generation as stamped in previous full backup. You’ll also note that the LowestGenerationRequired and HighestGenerationRequired is 31 which corresponds to the low log value stamped in current full backup.
In this case log file copy will start at generation 21 (0x14). Events in the application log correspond with this:
Event Type: Information Event Source: MSExchangeRepl Event Category: Service Event ID: 2114 Date: 8/9/2009 Time: 2:51:20 PM User: N/A Computer: 2003-NODE2 Description: The replication instance for storage group 2003-MBX3\2003-MBX3-SG1 has started copying transaction log files. The first log file successfully copied was generation 21.
The difference between this example and previous examples is that a full backup was performed. The decision to start copy at log 21 (0x14), which is based on previous full backup, makes sense if you think about the replication service. Remember that a database can be backed up either from the active or passive nodes. If I did not base my log file copy on previous full backup that means that I would not have all the logs on my passive copy since the last full backup. This would essentially prevent me from, at a later point in time, performing an incremental backup. (Remember an incremental backup requires all log files from the previous full backup be present).
When a database has had a full backup on it replication will remain healthy as long as all logs are contiguous on the source from the high log generation as stamped in previous full backup to the current log.
*Database seeded using Update-StorageGroupCopy where a full and incremental backup was performed on the source database.
In this example we have a database on a source server that has had a full and incremental backup performed on it (in this case an ESE online streaming backup). The storage group replication between nodes was suspended using suspend-storagegroupcopy. Then the update-storagegroupcopy command was used to stream the database to the target server (the –manualResume switch was also used so I could generate the header dumps). Below is a sample header dump of a database post an update-storagegroupcopy.
File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:08/09/2009 14:10:12 Rand:7948610 Computer: cbDbPage: 8192 dbtime: 22745 (0x58d9) State: Dirty Shutdown Log Required: 50-50 (0x32-0x32) Log Committed: 0-51 (0x0-0x33) Streaming File: No Shadowed: Yes Last Objid: 134 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 0 Repair Date: 00/00/1900 00:00:00 Old Repair Count: 0 Last Consistent: (0x30,A,1C1) 08/09/2009 14:59:22 Last Attach: (0x32,9,86) 08/09/2009 14:59:24 Last Detach: (0x0,0,0) 00/00/1900 00:00:00 Dbid: 1 Log Signature: Create time:08/09/2009 14:10:08 Rand:7930576 Computer: OS Version: (5.2.3790 SP 2)
Previous Incremental Backup: Log Gen: 5-34 (0x5-0x22) Mark: (0x23,8,16) Mark: 08/09/2009 14:59:00
Current Full Backup: Log Gen: 50-51 (0x32-0x33) Mark: (0x33,F,29) Mark: 08/09/2009 15:00:05
Operation completed successfully in 0.78 seconds.
In this header dump you will note that Previous Incremental Backup is populated. The low log generation is 5 (0x5) and the high log generation is 34 (0x22).
8933 61007400610044 2472 Cluster.Replay ReplicaInstance SetReplayState(): LowestGenerationPresent: 0 HighestGenerationPresent: 0 LowestGenerationRequired: 50 HighestGenerationRequired: 50 LastGenerationBackedUp: 34 CheckpointGeneration: 0 LogfileSignature: MJET_SIGNATURE(Random = 7930576,CreationTime = 8/9/2009 2:10:08 PM) LatestFullBackupTime: 8/9/2009 2:45:43 PM LatestIncrementalBackupTime: 8/9/2009 2:59:00 PM LatestDifferentialBackupTime: LatestCopyBackupTime: SnapshotBackup: False SnapshotLatestFullBackup: False SnapshotLatestIncrementalBackup: False SnapshotLatestDifferentialBackup: SnapshotLatestCopyBackup: ConsistentDatabase: False 8934 020C9AB5 2472 Cluster.Replay State CopyGenerationNumber is changing to 0 on replica 3d0099f3-ff35-46ea-8a2f-39eb50923209
In this output you will note that LastGenerationBackedUp is 34 (0x14). This corresponds to the high log generation as stamped in previous incremental backup. You’ll also note that the LowestGenerationRequired and HighestGenerationRequired is 50 which corresponds to the low log value stamped in current full backup.
In this case log file copy will start at generation 34 (0x22). Events in the application log correspond with this:
Event Type: Information Event Source: MSExchangeRepl Event Category: Service Event ID: 2114 Date: 8/9/2009 Time: 2:51:20 PM User: N/A Computer: 2003-NODE2 Description: The replication instance for storage group 2003-MBX3\2003-MBX3-SG1 has started copying transaction log files. The first log file successfully copied was generation 34.
The difference between this example and previous examples is that a full and incremental backup was performed. The decision to start copy at log 34 (0x22), which is based on previous incremental backup, makes sense if you think about the replication service. Remember that a database can be backed up either from the active or passive nodes. If I did not base my log file copy on previous incremental backup that means that I would not have all the logs on my passive copy since the last incremental backup. This would essentially prevent me from, at a later point in time, performing an incremental backup of the passive copy. (Remember and incremental backup requires all log files from the previous incremental backup be present).
When a database has had a incremental backup on it replication will remain healthy as long as all logs are contiguous on the source from the high log generation as stamped in previous incremental backup to the current log.
*So what does an example look like where the necessary logs are not present.
In this example I have a database that has had a full and incremental backup performed on it. I have suspended the storage group copy between nodes and forced log generation to occur. I then went into the source log directory, and removed two logs from the end of the log stream.
This is not an uncommon example. While storage group copy is failed or suspended logs will continue to generate on the source server. All full and incremental backups of the source will continue to be successful, but logs will not purge. Depending on the size of your log file drive, and the amount of time that copy is suspended or failed, your log drive may begin to fill up. This may lead to administrators manually purging the log file series.
Here is an eseutil /ml output of the source log directory showing the gap.
Initiating FILE DUMP mode...
Verifying log files... Base name: e00
Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000005.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000006.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000007.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000008.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000009.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000A.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000E.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000000F.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000010.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000011.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000012.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000013.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000014.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000015.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000016.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000017.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000018.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000019.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001A.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001E.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000001F.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000020.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000021.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000022.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000023.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000024.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000025.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000026.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000027.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000028.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000029.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002A.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002E.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000002F.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000030.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000031.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000032.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000033.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000034.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000035.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000036.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000037.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000038.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000039.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003A.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003E.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000003F.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000040.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000041.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000042.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000043.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000044.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000045.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000046.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000047.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000048.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000049.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004A.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004E.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000004F.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000050.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000051.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000052.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000053.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000054.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000055.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000056.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000057.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E0000000058.log - OK Missing log files: e00{00000059 - 0000005A}.log Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000005B.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000005C.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E000000005D.log - OK Log file: D:\2003-MBX3\2003-MBX3-SG1-Logs\E00.log - OK
Operation terminated with error -528 (JET_errMissingLogFile, Current log file missing) after 5.203 seconds.
On the passive node I have now issues my update-storagegroupcopy. Prior to resuming storage group copy I dumped the header of the database and here is the output.
File Type: Database Format ulMagic: 0x89abcdef Engine ulMagic: 0x89abcdef Format ulVersion: 0x620,12 Engine ulVersion: 0x620,12 Created ulVersion: 0x620,12 DB Signature: Create time:08/09/2009 14:10:12 Rand:7948610 Computer: cbDbPage: 8192 dbtime: 23064 (0x5a18) State: Dirty Shutdown Log Required: 94-94 (0x5e-0x5e) Log Committed: 0-95 (0x0-0x5f) Streaming File: No Shadowed: Yes Last Objid: 134 Scrub Dbtime: 0 (0x0) Scrub Date: 00/00/1900 00:00:00 Repair Count: 0 Repair Date: 00/00/1900 00:00:00 Old Repair Count: 0 Last Consistent: (0x5C,13,AC) 08/09/2009 15:20:34 Last Attach: (0x5E,9,86) 08/09/2009 15:21:58 Last Detach: (0x0,0,0) 00/00/1900 00:00:00 Dbid: 1 Log Signature: Create time:08/09/2009 14:10:08 Rand:7930576 Computer: OS Version: (5.2.3790 SP 2)
Previous Full Backup: Log Gen: 70-71 (0x46-0x47) Mark: (0x47,E,4F) Mark: 08/09/2009 15:06:34
Previous Incremental Backup: Log Gen: 5-75 (0x5-0x4b) Mark: (0x4C,8,16) Mark: 08/09/2009 15:17:03
Current Full Backup: Log Gen: 94-95 (0x5e-0x5f) Mark: (0x5F,C,198) Mark: 08/09/2009 15:22:10
Operation completed successfully in 0.62 seconds.
Based on the header of the database I can see that incremental backup is populated. Knowing our rules of replication – a database with an incremental backup will require all logs from the highest log generation, in this case 75 (0x4b) to current point in time be present on the source, contiguous, and able to be copied to the target.
From our ML output you can see that I removed log 0x59 and 0x5a (decimal 89 and 90).
I resumed storage group copy using resume-storagegroupcopy.
The following event was logged indicating that copy started at log 75 (0x4b).
Event Type: Information Event Source: MSExchangeRepl Event Category: Service Event ID: 2114 Date: 8/9/2009 Time: 3:26:06 PM User: N/A Computer: 2003-NODE2 Description: The replication instance for storage group 2003-MBX3\2003-MBX3-SG1 has started copying transaction log files. The first log file successfully copied was generation 75.
As logs copied between the nodes, I noticed that using get-storagegroupcopystatus the storage group in question was in a failed state.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte atus th ength dLogTime ---- ------------- ------------- ------------ ------------ 2003-MBX3-SG1 Failed 0 0 2003-MBX3-SG2 Suspended 0 0 8/9/2009 ...
By reviewing the application log I noticed that our failure was due to the inability to copy the 0x59 (89) log file from the source. This makes sense since I knowingly deleted it…and is expected since I know that all logs from the high log generation stamped in previous incremental backup to current time must be present on the source, contiguous, and able to get copied to the target.
Here is the sample error text:
Event Type: Error Event Source: MSExchangeRepl Event Category: Service Event ID: 2059 Date: 8/9/2009 Time: 3:27:07 PM User: N/A Computer: 2003-NODE2 Description: The log file \\2003-mbx3-replc\3d0099f3-ff35-46ea-8a2f-39eb50923209$\E0000000059.log for 2003-MBX3\2003-MBX3-SG1 is missing on the production copy. Continuous replication for this storage group is blocked. If you removed the log file, please replace it. If the log is lost, the passive copy will need to be reseeded using the Update-StorageGroupCopy cmdlet in the Exchange Management Shell.
What is confusing about this event is that the administrator is advised to run update-storagegroupcopy…and that’s what we just ran that generated the error. Based solely on the event, and without knowledge of what logs are required by the replication service depending on the state of the database, one could end up in an endless loop of update-storagegroupcopy and log file copy failures.
Now…how can this condition be corrected. The condition can be corrected by running a new full backup on the database prior to running the update-storagegroup copy. The full backup will reset the previous incremental information, and stamp values in the previous full backup section based on the logs that are not missing.
*So what are the log file copy rules:
If a database is offline (clean shutdown) then all logs from last consistent value to current point must be present on the source, be contiguous, and able to be copied to the target.
If a database is online but has never had a full or incremental backup then all the logs form the anchor log at the time the update-storagegroupcopy was initiated to the current log must exist on the source, must be contiguous, and must be able to be copied to the target.
If a database is online, and has had a full backup performed on it, then all logs from the high log generation, as stamped in previous full backup, must be present on the source, must be contiguous, and must be able to be copied to the target.
If a database is online and has had a full and incremental backup performed on it then all logs from the high log generation as stamped in previous incremental backup must be present on the source, must be contiguous, and must be able to be copied to the target.
In Windows 2008 clusters, by default, all network name resources are enabled for Kerberos. This causes the cluster service to create a machine account for the network name resource. This is known a VCO or Virtual Computer Object.
When the machine account associated with a network name is deleted the network name in cluster will fail to come online.
There are events in the system log associated with this action which help to explain why.
Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 8/16/2009 3:31:40 PM Event ID: 1207 Task Category: Network Name Resource Level: Error Keywords: User: SYSTEM Computer: Node-2.domain.com
Description: Cluster network name resource 'Network Name (MBX-1)' cannot be brought online. The computer object associated with the resource could not be updated in domain 'domain.com’ for the following reason: Unable to find computer account on DC where it was created.
The text for the associated error code is: There is no such object on the server.
The cluster identity 'CLUSTER-1$' may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.
Cluster is aware of the DC where the object was created, and stamps this property as a private property of the network name resource.
cluster.exe <clusterFQDN> res “Network Name (MBX-1)” /priv
Listing private properties for 'Network Name (MBX-1)':
T Resource Name Value
-- -------------------- ------------------------------ -----------------------
BR Network Name (MBX-1) ResourceData 01 00 00 00 ... (260 bytes)
DR Network Name (MBX-1) StatusNetBIOS 0 (0x0)
DR Network Name (MBX-1) StatusDNS 0 (0x0)
DR Network Name (MBX-1) StatusKerberos 8240 (0x2030)
SR Network Name (MBX-1) CreatingDC \\DC-1.domain.com
FTR Network Name (MBX-1) LastDNSUpdateTime 8/14/2009 3:07:59 AM
SR Network Name (MBX-1) ObjectGUID 01e46402b3cc8a4fa124bd76a3801f69
S Network Name (MBX-1) Name MBX-1
S Network Name (MBX-1) DnsName MBX-1
D Network Name (MBX-1) RemapPipeNames 0 (0x0)
D Network Name (MBX-1) RequireDNS 0 (0x0)
D Network Name (MBX-1) RequireKerberos 1 (0x1)
D Network Name (MBX-1) HostRecordTTL 1200 (0x4b0)
D Network Name (MBX-1) RegisterAllProvidersIP 0 (0x0)
D Network Name (MBX-1) PublishPTRRecords 0 (0x0)
D Network Name (MBX-1) TimerCallbackAdditionalThreshold 5 (0x5)
D Network Name (MBX-1) MSExchange_NetName 1 (0x1)
S Network Name (MBX-1) RequireKerbero 0
S Network Name (MBX-1) requirekerbeoros 1
S Network Name (MBX-1) requirekeberos 1
You’ll also note that the requireKerberos setting is set to 1 = enabled.
There are other ways to recover the VCO, but from an Exchange standpoint I find these to be the easiest…
1) Create a new machine account in the desired container with the same name as the VCO / CNO.
2) Using this blog post establish the permissions for the CNO on the new VCO. (http://blogs.technet.com/timmcmic/archive/2009/02/24/permissions-required-for-the-cno-cluster-name-object-in-windows-2008-for-exchange-2007-sp1-setup-operations.aspx)
3) Ensure the new machine account is disabled and allow time for ad replication.
4) Ensure that you have your Exchange 2007 SP1 media on hand.
5) Ensure that all resources in the CMS cluster group have been taken offline.
6) Using the media and a command prompt, run the following command –> setup.com /clearLocalCMS.
7) Recover the CMS to the cluster –> setup.com /recoverCMS /cmsName:<NAME> /cmsIPAddress:<IPAddress> or setup.com /recoverCMS /cmsName:<NAME> /cmsIPv4Addresses:<IPAddress1>,<IPAddress2>
When these steps are completed all Exchange resources should now be available and online as well as the new machine account created in an enabled state.
When Exchange 2007 is installed on a cluster there are several Exchange 2007 specific resources that are created. These include:
Sometimes these clustered resources are accidentally deleted using cluster administrator or failover cluster manager. This results in a portion of the solution not functioning.
Some clustered applications allow you to recreate individual clustered resources by using cluster administrator (or failover cluster management).
Although the resource is created successfully within cluster, it will ultimately fail for Exchange use.
Each Exchange resource that is created by the integrated setup routine is stamped with Exchange specific values. This is what allows the integration between Exchange and Windows cluster to function.
Let’s take a look at some of these values.
Exchange System Attendant Instance (CMSName)
Listing private properties for 'Exchange System Attendant Instance (2008-MBX3)':
S Exchange System NetworkName 2008-MBX3
Attendant Instance (2008-MBX3)
The network name private property links the system attendant resource to the appropriate network name. This value is not stamped by simply recreating the resource.
Exchange Information Store Instance (CMSName)
Listing private properties for 'Exchange Information Store Instance (2008-MBX3)':
D Exchange Information Store Instance (2008-MBX3) ResourceVersion 524289 (0x80001)
D Exchange Information Store Instance (2008-MBX3) ResourceBuild 671088646 (0x28000006)
S Exchange Information Store Instance (2008-MBX3) NetworkName 2008-MBX3
S Exchange Information Store Instance (2008-MBX3) DestPath C:\Program Files\Microsoft\Exchange Server\Mailbox\MDBDATA
D Exchange Information Store Instance (2008-MBX3) ClusteredStorageType 1 (0x1)
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_Seeding False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ReplicaInitializing False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ConfigBroken False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CopyNotificationGenerationNumber 123
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CopyGenerationNumber 123
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_InspectorGenerationNumber 122
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ReplayGenerationNumber 121
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestCopyNotificationTime 128853118429157196
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestCopyTime 128853118429157196
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestInspectorTime 128854139574491853
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestReplayTime 128853118426031936
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CurrentReplayTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_NoLoss True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_MountAllowed True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestFullBackupTime 128666480930000000
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestIncrementalBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestDifferentialBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestCopyBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotLatestFullBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotLatestIncrementalBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotLatestDifferentialBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotLatestCopyBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LastFailoverTime 128661508136602054
S Exchange Information Store Instance (2008-MBX3) LatestOnlineTime 128882659585976345
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_MountAllowed True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_NoLoss True
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SuspendCurrentOwner Idle
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ReplicaInitializing True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CopyGenerationNumber 66
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CopyNotificationGenerationNumber 66
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_InspectorGenerationNumber 65
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ReplayGenerationNumber 64
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_CurrentReplayTime 128882708597977356
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestCopyTime 128882708598133624
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestCopyNotificationTime 128882708598133624
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestInspectorTime 128891788251028675
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestReplayTime 128882708591257832
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SuspendSuspendWanted False
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SuspendMessage
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_SuspendCurrentOwner Idle
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_ReplicaInitializing True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_MountAllowed True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_NoLoss True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LastFailoverTime 128661508146915082
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_ConfigBroken False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_ConfigBrokenMessage
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_Seeding False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_ReplicaInitializing False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_CopyNotificationGenerationNumber 119
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_CopyGenerationNumber 119
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_InspectorGenerationNumber 118
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_ReplayGenerationNumber 117
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestCopyNotificationTime 128853118430094774
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestCopyTime 128853118430094774
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestInspectorTime 128854139607929353
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestReplayTime 128853118426500725
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_CurrentReplayTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_NoLoss True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_MountAllowed True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestFullBackupTime 128666481160000000
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestIncrementalBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestDifferentialBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_LatestCopyBackupTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotLatestFullBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotLatestIncrementalBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotLatestDifferentialBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotLatestCopyBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_CopyGenerationNumber 64
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_CopyNotificationGenerationNumber 64
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_InspectorGenerationNumber 63
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_ReplayGenerationNumber 62
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_CurrentReplayTime 128882708597821088
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_LatestCopyTime 128882708598133624
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_LatestCopyNotificationTime 128882708598133624
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_LatestInspectorTime 128891788251028675
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_LatestReplayTime 128882708593914388
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_SuspendSuspendWanted False
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_SuspendMessage
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_Seeding False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_Seeding False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LastFailoverTime 128661383610946829
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node2_58da7215-7d0c-4d18-835a-848bde0ce408_LastFailoverTime 128661383799855560
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_DumpsterRedeliveryCreationTime 180000000000
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_DumpsterRedeliveryEndTime 180000000000
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_DumpsterRedeliveryRequired False
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_DumpsterRedeliveryStartTime 633572168518476877
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_DumpsterRedeliveryCreationTime 180000000000
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_DumpsterRedeliveryEndTime 180000000000
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_DumpsterRedeliveryRequired False
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_DumpsterRedeliveryStartTime 633572168264869789
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408_DumpsterRedeliveryServers
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5_DumpsterRedeliveryServers
S Exchange Information Store Instance (2008-MBX3) Replay_2008-Node1_7096c806-d69d-41b8-ae1d-50ada0b0dce5_ConfigBrokenMessage
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_LatestFullBackupTime 128860168890000000
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_7096c806-d69d-41b8-ae1d-50ada0b0dce5_SnapshotLatestFullBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_LatestFullBackupTime 128860169140000000
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_58da7215-7d0c-4d18-835a-848bde0ce408_SnapshotLatestFullBackup False
S Exchange Information Store Instance (2008-MBX3) Replay_2008-node2_25b8ef30-3bae-474f-b075-8068fb524308_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_SuspendCurrentOwner Idle
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_ReplicaInitializing True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_SuspendCurrentOwner Idle
S Exchange Information Store Instance (2008-MBX3) Replay_[LOCKS]_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_ReplicaInitializing True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_LatestCopyNotificationTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_LatestInspectorTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_LatestCopyNotificationTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE1.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_LatestInspectorTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_ConfigBroken True
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_ConfigBrokenMessage Status information cannot be displayed correctly because the storage group is running on a later version of Exchange Server than the client that is requesting the status information.
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_TargetReplicaInstanceState NotRunning
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_LatestCopyNotificationTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_7096c806-d69d-41b8-ae1d-50ada0b0dce5|Standby_LatestInspectorTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_LatestCopyNotificationTime 0
S Exchange Information Store Instance (2008-MBX3) Replay_2008-NODE2.exchange.msft_58da7215-7d0c-4d18-835a-848bde0ce408|Standby_LatestInspectorTime 0
In this case the information store resource has several private properties that are not re-created by simply creating the resource. These include the network name (similar to system attendant), the cluster storage type indicating the type of cluster used (CCR or SCC), and other private properties coordinating to the replication status of databases hosted on the server.
Exchange Database Instances
Listing private properties for '2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3)':
S 2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3) DatabaseGuid 5be19b1d-845b-4a23-8aa2-d98abbd06274
S 2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3) StorageGroupGuid 58da7215-7d0c-4d18-835a-848bde0ce408
S 2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3) NetworkName 2008-MBX3
S 2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3) LatestOfflineTime 128882708598133624
S 2008-MBX3-SG2/2008-MBX3-SG2-DB1 (2008-MBX3) LastMountedOnServer 2008-NODE1
The database instances also have links that are missing when resources are created through cluster administrator. For example, database instances are linked to their storage groups and databases by stamping GUIDs onto the cluster resource. In this case the database guid is stamped into the private property DatabaseGuid and the storage group guid stamped into the private property StorageGroupGuid. Without these attributes the database instances will not function.
It is possible in some instances to manually go back and re-stamp these private properties. Particular care has to be taken to ensure that this happens correctly. If it does not happen correctly unknown results may occur and the resource may not function.
IN GENERAL I DISCOURAGE ATTEMPTING TO MANUALLY RECREATE CLUSTERED RESOURCES!
In the event a deletion occurs, the following steps can be used to recover from the deletion.
1) Navigate to the node that currently owns the Exchange resources.
2) Make note of the CMS name and CMS IP address (properties of the Exchange network name resource and the Exchange IP resource).
3) Using the Exchange Management Shell, issue a stop-clusteredmailboxserver. Fill in the prompted information as necessary. This will take the CMS offline.
4) Using your Exchange 2007 SP1 media, issue a setup.com /clearLocalCMS /cmsName:<CMSName>.
http://technet.microsoft.com/en-us/library/cc164362.aspx
By clearing the CMS you have removed the clustered configuration associated with that CMS.
5) Recover the CMS to the cluster by using the Exchange 2007 SP1 media and issuing setup.com /recoverCMS /cmsName:<CMSName> /cmsIPv4Addresses:<IPAddress>,<IPAddress> or setup.com /recoverCMS /cmsName:<CMSName> /cmsIPv4Address:<IPAddress>
http://technet.microsoft.com/en-us/library/bb124095.aspx
By recovering the CMS all clustered resources are refreshed and recreated. This ensures that all attributes are stamped onto the cluster resources and the cluster should function as expected.
===================================================================================================
Update: 6/5/2011
I was recently contacted by a co-worker who found a more efficient way to recovery a deleted database instance from a clustered mailbox server. Let’s take a look at an example:
In the example cluster MBX-3 I have two existing database instances:
Using Failover Cluster Manager I delete one of the database instances. Here is the resulting view in Failover Cluster Manager:
Using the above instructions would require the administrator to recover the entire CMS. It was found that if you create another storage group and database the database instances within the cluster are refreshed. For example:
new-storagegroup –name <SG-NAME> –server <CMSName>
new-mailboxdatabase –name <DB-NAME> –storagegroup <CMSNAME\SG-NAME>
This creates a new database instance within cluster for the new database and re-creates the missing database instance for you.
The administrator can then remove the new mailbox database and storage group created and bring online the previously missing database instance.
remove-mailboxdatabase <DB-NAME>
remove-storagegroup <SG-NAME>
start-clusteredmailboxserver <CMSName>
These steps were verified on a Windows 2008 R2 / Exchange 2007 SP3 CCR cluster. A special thanks to my co-worker Michael Barta for pointing this out to me!
When running Exchange 2007 SP1 and the command
setup.com /recoverCMS /cmsName:<NAME> /cmsIPAddress:<IP>
you may receive the following warning:
The installed version of Exchange Server 2007 may be different from the version you are trying to install. The current installed version is '8.1.359.2’, the last installed version was '8.1.240.6'.
In this case the version 8.1.240.6 represents Exchange 2007 SP1. The version 8.1.359.2 represents Exchange 2007 SP1 RU7.
One of the pre-req checks for recoverCMS is to warn when the version of Exchange installed on the node where recovery is run does not match the version recorded on the Exchange active directory object.
The issue here is that when a rollup update is installed the version attribute in active directory is not updated (and this is by design). For example, here is an LDP dump of an Exchange CMS object and it’s associated version. Reviewing the serial number attribute you will see Version 8.1 (Build 30240.6). This translates to Version 8.1.240.6 (combine the version and drop the 30).
Expanding base 'CN=2003-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Microsoft,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft'... Result <0>: (null) Matched DNs: Getting 1 entries: >> Dn: CN=2003-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Microsoft,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft 3> objectClass: top; server; msExchExchangeServer; 1> cn: 2003-MBX1; 1> serialNumber: Version 8.1 (Build 30240.6); 1> distinguishedName: CN=2003-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Microsoft,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft; 1> instanceType: 0x4 = ( IT_WRITE ); 1> whenCreated: 10/01/2008 07:07:30 Eastern Standard Time Eastern Standard Time; 1> whenChanged: 10/01/2008 07:19:24 Eastern Standard Time Eastern Standard Time; 1> uSNCreated: 37192; 1> uSNChanged: 37276; 1> showInAdvancedViewOnly: TRUE; 1> adminDisplayName: 2003-MBX1; 6> networkAddress: ncacn_vns_spp:2003-MBX1; netbios:2003-MBX1; ncacn_np:2003-MBX1; ncacn_spx:2003-MBX1; ncacn_ip_tcp:2003-MBX1.exchange.msft; ncalrpc:2003-MBX1; 1> name: 2003-MBX1; 1> objectGUID: d7452d19-806a-43ea-b163-24d265f096d7; 1> versionNumber: 1912701168; 1> serverRole: 0; 1> systemFlags: 0x52000000 = ( FLAG_CONFIG_ALLOW_RENAME | FLAG_CONFIG_ALLOW_LIMITED_MOVE | FLAG_DISALLOW_MOVE_ON_DELETE ); 1> legacyExchangeDN: /o=Microsoft/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Configuration/cn=Servers/cn=2003-MBX1; 1> objectCategory: CN=ms-Exch-Exchange-Server,CN=Schema,CN=Configuration,DC=exchange,DC=msft; 1> msExchTransportTransientFailureRetryInterval: 300; 1> msExchTransportMessageRetryInterval: 60; 1> msExchTransportInternalMaxDSNMessageAttachmentSize: 10485760; 1> msExchMailboxManagerActivationSchedule: ; 1> msExchTransportMessageTrackingPath: C:\Program Files\Microsoft\Exchange Server\TransportRoles\Logs\MessageTracking; 1> msExchTransportMaxConcurrentMailboxSubmissions: 20; 1> msExchMailboxManagerActivationStyle: 0; 1> type: <ldp: Binary blob>; 1> msExchMessageTrackLogFilter: -262145; 1> msExchMinAdminVersion: -2147453113; 1> msExchInstallPath: C:\Program Files\Microsoft\Exchange Server; 1> msExchMailboxManagerAdminMode: 2; 1> msExchTransportMaxConnectivityLogAge: 2592000; 1> msExchTransportOutboundConnectionFailureRetryInterval: 600; 1> msExchTransportMaxPickupDirectoryMessagesPerMinute: 100; 1> messageTrackingEnabled: FALSE; 1> msExchServerSite: CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=exchange,DC=msft; 1> msExchTransportMaxConcurrentMailboxDeliveries: 7; 1> msExchServerRole: 0; 1> msExchEdgeSyncAdamSSLPort: 50636; 1> msExchTransportMaxMessageTrackingDirectorySize: 262144000; 1> msExchTrkLogCleaningInterval: 7; 1> msExchTransportDelayNotificationTimeout: 14400; 1> msExchTransportRoutingLogMaxAge: 604800; 1> msExchTransportMaxMessageTrackingFileSize: 10485760; 1> msExchTransportExternalDefaultLanguage: en-US; 1> msExchResponsibleMTAServer: CN=Microsoft MTA,CN=2003-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Microsoft,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft; 1> msExchCurrentServerRoles: 2; 1> msExchTransportMaxMessageTrackingLogAge: 2592000; 1> msExchTransportPoisonMessageThreshold: 2; 1> msExchDataLossForAutoDatabaseMount: 0; 1> msExchTransportMaxPickupDirectoryHeaderSize: 65536; 1> msExchTransportMaxPickupDirectoryRecipients: 100; 1> msExchDataPath: C:\Program Files\Microsoft\Exchange Server\Mailbox; 1> msExchSmtpReceiveMaxConnectionRatePerMinute: 1200; 1> msExchTransportMaxQueueIdleTime: 180; 1> msExchTransportMaxReceiveProtocolLogAge: 2592000; 1> msExchTransportExternalMaxDSNMessageAttachmentSize: 10485760; 1> heuristics: 268435456; 1> msExchTransportFlags: 17401; 1> msExchMonitoringResources: 2:1:Default Microsoft Exchange Services:MSExchangeSA:MSExchangeMTA:RESvc:SMTPSVC:MSExchangeIS:W3SVC:; 1> msExchTransportInternalDefaultLanguage: en-US; 1> msExchELCAuditLogFileAgeLimit: 0; 1> msExchHomeRoutingGroup: CN=Exchange Routing Group (DWBGZMFD01QNBJR),CN=Routing Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Microsoft,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft; 1> msExchTransportMaxSendProtocolLogAge: 2592000; 1> msExchELCAuditLogFileSizeLimit: 10485760; 1> msExchELCAuditLogPath: C:\Program Files\Microsoft\Exchange Server\Logging\Managed Folder Assistant; 1> msExchTransportTransientFailureRetryCount: 6; 1> msExchTransportMessageExpirationTimeout: 172800; 1> msExchVersion: 4535486012416; -----------
Now we know how to determine the Exchange version based on what is stamped in active directory, how do we determine the version of Exchange installed on the machine. The version of Exchange installed on the machine is determined by the version of ExSETUP.exe installed on the machine. ExSETUP.exe can be found in X:\Program Files\Microsoft\Exchange Server\BIN (assuming X is the drive where Exchange was installed and that the default path was used). Here is an example version of ExSETUP.exe.
ExSETUP is updated with each rollup that is installed onto the machine.
Version 8.1.240.6 != 8.1.359.2 or another way of saying it Version 8.1.240.6 ≠ 8.1.359.2 and the error / warning is thrown.
With later version of the XML pre-reqs file this text is a warning. With prior version of the XML pre-reqs file this is a hard stop error preventing /recoverCMS from completing successfully.
If you have hit the error condition, and cannot proceed, the following instructions should allow you to continue:
1) Find and locate ExSETUP.exe in X:\Program Files\Microsoft\Exchange Server\Bin.
2) Rename the local version of ExSETUP.exe to ExSETUP.exe.original.
3) From your Exchange installation media, locate the version of ExSETUP.exe. (Media\Setup\ServerRoles\Common).
4) Copy this ExSETUP.exe to the bin directory.
5) Run your setup command and allow it to complete successfully.
6) When completed, either reinstall the last rollup update applied or delete the copied ExSETUP.exe and rename the original back to ExSETUP.exe.
When planning an Exchange 2007 installation storage considerations are a very important factor. I wanted to call attention to a sizing consideration that may not be accounted for in other places. This sizing consideration should be accounted for when planning an environment that may use Cluster Continuous Replication, Local Continuous Replication, or Standby Continuous Replication.
When planning your log file drive sizing it is important to consider when log files are actually purged. It is important to remember that log file purging cannot occur unless:
When the above criteria cannot be met log files will continue to fill the log file volume even if a backup is performed successfully. If the criteria above are not met, and a backup is performed successfully, logs will not be truncated.
Depending on the number of days that the replication partner is not available, this may result in a large number of log files remaining on the log file drive and in some instances a log file drive full condition results. When the log file drive is full and logs can no longer be created, the database instance(s) will dismount and be unavailable.
This would require the administrator to either return the log copy target to availability so that logs can be copied and purged or to purge log files manually. In the event log files are purged manually a full database re-seed of the passive target would be necessary.
In planning you should consider factors that might cause nodes to be unavailable for an extended period of time – for example WAN issues. If necessary increase the size of your log file volume to accommodate for periods where replication cannot occur. For example, if your log generation per day is estimated at 2000 logs, and you estimate that any outage of a node or network etc could last up to 5 days, you need to make plans to accommodate up to 5 days of non-replicated log files.
The storage calculator can assist you in your planning. There are two areas of the storage calculator that help you take into account planned outages of both replication and backups in order to provide a more accurate log file drive sizing. The two areas of the storage calculator that can help you better estimate log file volume size for backup and network issues are:
Download the storage calculator here -
http://msexchangeteam.com/files/12/attachments/entry438481.aspx
Read about the storage calculator here -
http://msexchangeteam.com/archive/2007/01/15/432207.aspx
By planning for lack of replication ahead of time you can hopefully ward off any out of disk space conditions or conditions that may cause a full re-seed to become necessary.
Many customers have requested instructions on how to enable standby continuous replication to use an alternate network interface. By design standby continuous replication always uses the “public” interface to ship logs and seed the database.
Over the past few weeks we have been working with the Exchange product group on a “supported” method to allow standby continuous replication to use an alternate network interface. This blog will detail how to implement these steps and what effects it has on the overall solution.
First if you are reading this post you should review the replication service deep dive whitepaper located at http://technet.microsoft.com/en-us/library/cc535020.aspx (“White Paper: Continuous Replication Deep Dive"). When reviewing this whitepaper it is important to pay attention to what sources are involved in replication when using standby continuous replication. For example:
Keeping these parameters in mind will help you understand how the following changes will allow for standby continuous replication to use an alternate network interface.
The steps to implement this vary little by operating system. Windows 2008 though does introduce some changes to the way file shares are handled. Please review this blog for information on how share scoping in Windows 2008 effects the operations of the replication service. (http://blogs.technet.com/timmcmic/archive/2008/12/23/exchange-replication-service-exchange-2007-sp1-and-windows-2008-clusters.aspx)
The following instructions are based on Exchange 2007 SP1 with RU7. All customers implementing these instructions are encouraged to do so on Exchange 2007 SP1 RU7.
Replication behavior when using standby continuous replication over an alternate network interface.
When the instructions are implemented as documented, all network traffic from the SCR target to the SCR source is first routed through the private interface. This can be verified with netmon by reviewing SMB (Windows 2003) or SMBv2 (Windows 2008) traffic.
It is important to note that these instructions only effect the LOG SHIPPING functionality of SCR. Other functions such as update-storagegroupcopy will only occur using the public interface. This requires that both the source and target have the ability to communicate over both the public and private interfaces. Planning for network sizing should take into account that re-seeding operations using update-storagegroupcopy must occur over the public interface.
Unlike continuous replication host names in CCR there is no automatic failover between interfaces. Should the private interface serving log shipping be unavailable for any reason, log shipping will fail. With this in mind appropriate monitoring of log copy operations is necessary to ensure replication is functioning. In the event that the network link serving replication is not available, the host file should be removed and replication resumed over the public interface. As mentioned earlier your network design considerations should take into account the need to communicate over both the public and private interfaces as well as the potential need to perform log shipping operations over the public interface.
For the solution to be fully supported network connectivity must be available between the source and target on both the private and public interfaces. All replication operations must be able to function on both interfaces.
When engaging product support services for assistance with replication when these steps are used you may be requested to remove the host file and verify that log shipping works as originally designed with no modifications.
Behavior of commandlets used for implementing / managing standby continuous replication when replication is enabled to use an alternate interface.
Get-storagegroupcopystatus: No issues noted.
Enable-storagegroupcopy: No issues noted.
Disable-storagegroupcopy: No issues noted.
Restore-storagegroupcopy: No issues noted when machines involved are running Exchange 2007 SP1 RU7. Prior to RU7 it may be necessary to use restore-storagegroupcopy –force for the command to complete successfully.
Update-storagegroupcopy: Because update-storagegroupcopy uses online streaming functionality to seed the database to the target the network traffic associated with this occurs over the public interface.
Suspend-storagegroupcopy: No issues noted.
Resume-storagegroupcopy: No issues noted.
Changes to the SCR activation process when replication is enabled to use an alternate interface.
Whether using the database portability method or the single node cluster method after running restore-storagegroupcopy the entries in the host file should be removed or commented out. Once the removal is complete, dns resolver cache should be flushed (ipconfig /flushdns) and a ping from the target machine to it’s own name performed to ensure DNS resolves the correct IP address on the public interface.
When name resolution occurs successfully your move-mailbox –configurationonly or setup.com /recoverCMS can be run to complete the activation process.
Configuring networks and network interfaces to support standby continuous replication using an alternate network interface on Windows 2008.
The first step is to configure the network settings for the network interface that will be used for standby continuous replication. These instructions are performed on both the source and target machines. To configure these settings:
The network configuration process is then completed by updating the network binding orders. To update the network binding orders:
This completes the base networking configuration for standalone machines and clustered nodes.
Additional configuration steps for SCR source servers on Windows 2008.
Additional configuration steps for SCR Targets on Windows 2008.
These instructions apply to both standalone and single node SCR targets based on Windows 2008.
Using notepad, open the hosts files located at c:\Windows\System32\Drivers\Etc
Depending on the source make the following changes:
Here is the output of a sample host file.
# Copyright (c) 1993-2006 Microsoft Corp. # # This is a sample HOSTS file used by Microsoft TCP/IP for Windows. # # This file contains the mappings of IP addresses to host names. Each # entry should be kept on an individual line. The IP address should # be placed in the first column followed by the corresponding host name. # The IP address and the host name should be separated by at least one # space. # # Additionally, comments (such as these) may be inserted on individual # lines or following the machine name denoted by a '#' symbol. # # For example: # # 102.54.94.97 rhino.acme.com # source server # 38.25.63.10 x.acme.com # x client host
127.0.0.1 localhost ::1 localhost
#Exchange 2007 SP1 / Windows 2008 / Standalone Mailbox Server
10.1.1.1 2008-MBX1 10.1.1.1 2008-MBX1.exchange.msft
#Exchange 2007 SP1 / Windows 2008 / Cluster Continuous Replication (CCR)
10.1.1.3 2008-Node1 10.1.1.3 2008-Node1.exchange.msft 10.1.1.4 2008-Node2 10.1.1.4 2008-Node2.exchange.msft 10.1.1.8 2008-Node5 10.1.1.8 2008-Node5.exchange.msft 10.1.1.9 2008-Node6 10.1.1.9 2008-Node6.exchange.msft
#Exchange 2007 SP1 / Windows 2008 / Single Copy Cluster (SCC)
10.1.1.7 2008-MBX4 10.1.1.7 2008-MBX4.exchange.msft
Additionally, the replication service on occasion may have to resort to Netbios name resolution. To ensure that the correct replication network is always returned, edit the LMHOST file and put entries for the netbios name and corresponding IP address.
Using notepad, open the LMhosts files located at c:\Windows\System32\Drivers\Etc
Here is a sample LMHost file.
# Copyright (c) 1993-1999 Microsoft Corp. # # This is a sample LMHOSTS file used by the Microsoft TCP/IP for Windows. # # This file contains the mappings of IP addresses to computernames # (NetBIOS) names. Each entry should be kept on an individual line. # The IP address should be placed in the first column followed by the # corresponding computername. The address and the computername # should be separated by at least one space or tab. The "#" character # is generally used to denote the start of a comment (see the exceptions # below). # # This file is compatible with Microsoft LAN Manager 2.x TCP/IP lmhosts # files and offers the following extensions: # # #PRE # #DOM:<domain> # #INCLUDE <filename> # #BEGIN_ALTERNATE # #END_ALTERNATE # \0xnn (non-printing character support) # # Following any entry in the file with the characters "#PRE" will cause # the entry to be preloaded into the name cache. By default, entries are # not preloaded, but are parsed only after dynamic name resolution fails. # # Following an entry with the "#DOM:<domain>" tag will associate the # entry with the domain specified by <domain>. This affects how the # browser and logon services behave in TCP/IP environments. To preload # the host name associated with #DOM entry, it is necessary to also add a # #PRE to the line. The <domain> is always preloaded although it will not # be shown when the name cache is viewed. # # Specifying "#INCLUDE <filename>" will force the RFC NetBIOS (NBT) # software to seek the specified <filename> and parse it as if it were # local. <filename> is generally a UNC-based name, allowing a # centralized lmhosts file to be maintained on a server. # It is ALWAYS necessary to provide a mapping for the IP address of the # server prior to the #INCLUDE. This mapping must use the #PRE directive. # In addtion the share "public" in the example below must be in the # LanManServer list of "NullSessionShares" in order for client machines to # be able to read the lmhosts file successfully. This key is under # \machine\system\currentcontrolset\services\lanmanserver\parameters\nullsessionshares # in the registry. Simply add "public" to the list found there. # # The #BEGIN_ and #END_ALTERNATE keywords allow multiple #INCLUDE # statements to be grouped together. Any single successful include # will cause the group to succeed. # # Finally, non-printing characters can be embedded in mappings by # first surrounding the NetBIOS name in quotations, then using the # \0xnn notation to specify a hex value for a non-printing character. # # The following example illustrates all of these extensions: # # 102.54.94.97 rhino #PRE #DOM:networking #net group's DC # 102.54.94.102 "appname \0x14" #special app server # 102.54.94.123 popular #PRE #source server # 102.54.94.117 localsrv #PRE #needed for the include # # #BEGIN_ALTERNATE # #INCLUDE \\localsrv\public\lmhosts # #INCLUDE \\rhino\public\lmhosts # #END_ALTERNATE # # In the above example, the "appname" server contains a special # character in its name, the "popular" and "localsrv" server names are # preloaded, and the "rhino" server name is specified so it can be used # to later #INCLUDE a centrally maintained lmhosts file if the "localsrv" # system is unavailable. # # Note that the whole file is parsed including comments on each lookup, # so keeping the number of comments to a minimum will improve performance. # Therefore it is not advisable to simply add lmhosts file entries onto the # end of this file.
10.1.1.1 2008-MBX1
10.1.1.3 2008-Node1 10.1.1.4 2008-Node2 10.1.1.8 2008-Node5 10.1.1.9 2008-Node6
10.1.1.7 2008-MBX4
This completes the configuration steps for Windows 2008.
Configuring networks and network interfaces to support standby continuous replication using an alternate network interface on Windows 2003.
Additional configuration steps for SCR source servers on Windows 2003.
Additional configuration steps for SCR Targets on Windows 2003.
These instructions apply to both standalone and single node SCR targets based on Windows 2003.
127.0.0.1 localhost
#Exchange 2007 SP1 / Windows 2003 / Standalone Mailbox Server
10.1.1.1 2003-MBX1 10.1.1.1 2003-MBX1.exchange.msft
#Exchange 2007 SP1 / Windows 2003 / Cluster Continuous Replication (CCR)
10.1.1.3 2003-Node1 10.1.1.3 2003-Node1.exchange.msft 10.1.1.4 2003-Node2 10.1.1.4 2003-Node2.exchange.msft
#Exchange 2007 SP1 / Windows 2003 / Single Copy Cluster (SCC)
10.1.1.7 2003-MBX4 10.1.1.7 2003-MBX4.exchange.msft
10.1.1.1 2003-MBX1
10.1.1.3 2003-Node1 10.1.1.4 2003-Node2
10.1.1.7 2003-MBX4
This completes the configuration steps for Windows 2003.
===============================
Updated Sunday, August 9th, 2009 with LMHOST instructions.
When using any form of multi-machine Exchange 2007 replication (CCR / SCR) Kerberos authentication is very important. We leverage the rights of Exchange server machine accounts for several functions including the ability to replicate log files and utilize the remote registry service for commandlets.
Some Background…
In terms of the replication service we copy logs between CCR and SCR servers using SMB file shares. These shares are created by the replication service. Permissions to access these shares are derived by assigning the share permission READ to the Exchange Servers group.
Note: We use a very restrictive read heuristic which does not fully equal the same read permission as set through the GUI so you’ll have to trust me here – the groups effective share permission is read.
In the Exchange Servers group we automatically place the machine account for each Exchange server installed.
By adding the Exchange servers machine account to the Exchange Servers group, and adding the Exchange Servers group with appropriate permissions, we’ve effectively allowed the machine accounts read access to the shares. The replication service, which runs under the local system security context, can then access the shares to pull log files.
The Issue…
It is becoming more common in environments to no longer see WINS installations. It is still a requirement of the product though that short name resolution work. Many administrators address this issue by using a DNS suffix list. You can set the DNS suffix list on the advanced settings of TCP/IP.
When a short name resolution request is made, the domains are appended in order as specified on the list. If the name can be found in the DNS zone of the name as appended, then this combination is returned as the fully qualified domain name.
For example, if I have a host Server1 that is registered only in the DNS namespace exchange.msft, when I issue a ping request to Server1 the first domain appended is external.exchange.msft. Since the machine is not registered in that dns domain, the second address is appended, in this case exchange.msft. With the machine registered in this domain, a successful name query response is received and the ping will continue successfully to server1.exchange.msft.
An ipconfig /all will display the appended DNS suffixes and their order.
In many circumstances the machine will only resolve in a single DNS domain. If this is the case then it will not affect keberos authentication. Where an issue occurs is where the machine resolves in more then one domain, and the domain it resolves it does not match the active directory domain (which in this case is what is registered on the service principle name records) of the machine account. Lets look at an example of how this can be an issue.
In this example the machine Server1 is registered in the dns domain external.exchange.msft and exchange.msft. My active directory DNS name space for the domain that the machine is a member of is exchange.msft.
My append dns suffixes in this order has the following list:
external.exchange.msft
exchange.msft
domain.com
When the replication service attempts to copy a log file it begins the authentication process to the share. The first step in this process is obtaining a kerberos ticket so we can leverage the permissions of the machine account (local system) for share access. The first name in the suffix list is appended, and a successful name resolution occurs. In this case the fully qualified domain name is believed to be Server1.external.exchange.msft. At this time the kerberos key distribution center is contacted, and a ticket is issued for Server1.external.exchange.msft. The next step is to access the share presenting this kerberos ticket. At this time an access denied is received to the share, and logs cannot be copied.
The reason the access denied is returned is that the service principle names stamped on the machine account in active directory for the server does not include Server1.external.exchange.msft, it only includes Server1.exchange.msft (the AD domain name). You can see the SPNs registered on the server by doing an LDP dump of the computer object in the active directory domain container. Here is an example:
servicePrincipalName (4): MSServerClusterMgmtAPI/2008-NODE1; MSServerClusterMgmtAPI/2008-Node1.exchange.msft; HOST/2008-Node1; HOST/2008-Node1.exchange.msft;
The issue in this case is easily corrected. To correct the issue, change the appended DNS suffix list to use the active directory domain first. For example:
With the updated DNS suffix list the server name determined is server1.exchange.msft. This name matches the entries of the service principle name and authentication can occur successfully and therefore log replication can occur without issues.
Other functions besides log replication can be impacted by the appended DNS suffix list. For example, certain commandlets such as get-storagegroupcopystatus and update-storagegroupcopy leverage the rights of the local system to access the remote registry service. These commandlets can also suffer access denied conditions as authenticated remote registry connections between servers can fail.
Here is a sample of the error text from a failed get-storagegroupcopystatus:
Microsoft Exchange Replication service RPC failed : Microsoft.Exchange.Rpc.RpcException: Error e0434f4d from cli_GetCopyStatusEx at Microsoft.Exchange.Rpc.Cluster.ReplayRpcClient.GetCopyStatusEx(Guid[] sgGuids, RpcStorageGroupCopyStatus[]& sgStatuses) at Microsoft.Exchange.Cluster.Replay.ReplayRpcClientWrapper.InternalGetCopyStatus(Strin g serverName, Guid[] sgGuids, RpcStorageGroupCopyStatus[]& sgStatuses, Int32 serverVersion) at Microsoft.Exchange.Cluster.Replay.RpcCopyStatusInfo.GetMergedStatusResults() at Microsoft.Exchange.Management.SystemConfigurationTasks.GetStorageGroupCopyStatus.Pre pareStatusEntryFromRpc(Boolean fCcr, Server server, StorageGroup storageGroup, StorageGroupCopyStatusEntry& entry)"
The moral of the story…
Replication and commandlet issues on Exchange servers can be avoided when using appended dns suffixes list but ensuring that the active directory DNS domain is the first to be appended.
When attempting to establish the cluster services on nodes that utilize a dis-joint DNS namespace, the following errors may be encountered:
Log Name: System Source: Microsoft-Windows-FailoverClustering Date: Date_Time Event ID: 1127 Task Category: None Level: Error Keywords: User: SYSTEM Computer: ComputerName Description: Cluster Network interface InterfaceName for cluster node NodeName on network NetworkName failed. Run the Validate a Configuration wizard to check your network configuration.
Log Name: System Source: Microsoft-Windows-FailoverClustering Date: Date_Time Event ID: 1207 Task Category: Network Name Resource Level: Error Keywords: User: SYSTEM Computer: Computer-name.domain.com Description: Cluster network name resource 'Cluster Name' cannot be brought online. The computer object associated with the resource could not be updated in domain 'disjoined.domain.com' for the following reason: Unable to update password for computer account. The text for the associated error code is: The password does not meet the password policy requirements. Check the minimum password length, password complexity and password history requirements. The cluster identity 'Cluster-name$' may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.
If you see errors similar to this, check out the following two links that may apply.
http://technet.microsoft.com/en-us/library/cc755926(WS.10).aspx
http://support.microsoft.com/kb/952247/en-us
If you are using a Windows 2008 / Exchange 2007 Single Copy Cluster (SCC) you should read my co-workers blog post regarding disk resource is-alive checking.
http://blogs.technet.com/brian_kern/archive/2009/06/04/windows-2008-and-the-isalive-sanity-check.aspx
Tim
Recently there have been some questions about transaction log rolling and continuous replication. In some cases these questions often surround storage group copy status showing an initializing state (http://blogs.technet.com/timmcmic/archive/2009/01/26/get-storagegroupcopystatus-initializing.aspx).
Under normal circumstances, the only time that log would roll, is when we’ve reached a log full condition. If the server is being utilized, this is not a problem, as logs will roll naturally as the server processes activity.
There are times though where the server is relatively idle. This would mean the current log generation would not receive enough transaction activity against it to cause it to roll over. This is where “transaction log roll” is important. If the current log file (ENN.log) contains a durable (or hard) commit, and that log is not filled in a period of time, it will be rolled over and shipped to the other side. (This is not an immediate process, if we rolled a log over every time there was a durable (hard) commit we'd generate a ton of logs). The article referenced above gives examples of how to calculate the time that a log would roll over should it contain a durable (hard) commit. The article above also contains the following text highlighting this behavior:
“The log roll mechanism does not generate transaction logs in the absence of user or other database activity. In fact, log roll is designed to occur only when there is a partially filled log.”
This information is important to us for several reasons.
The first is generally if logs roll why do my storage groups stay initializing for hours at a time. The answer is because the current log does not contain a durable commit. If you were to restart the replication service or suspend and resume a replication instance manually the first replication state you will encounter is initializing. We remain in initializing until a log is generated, copied, inspected, and put out for replay with divergence information determined. If no durable (hard) commit exists in the source log stream, the logs may not be rolled over until there is a durable (hard) commit or user activity, which means replication would stay in an initializing state for a while. My suggestion is, if this is a test environment, simply send mail / dismount the source databases / etc. In production, I've seen people script email to test mailboxes at a schedule time with a test mailbox located in each database. This causes a durable commit, which will eventually result in log file roll over and shipment to the other side.
The second reason is that log file roll can cause churn in the log file stream which does not appear normal. If you reference the link above you can see that an idle storage group could generate up to 960 log files a day. This is especially true of the storage group contains some type of system mailboxes (which exchange accesses causing a durable commit) or test mailboxes which the user is accessing. In either scenario, there may not be enough load by either process to force log roll to occur naturally, so Exchange rolls the log for you at a certain time. This causes some concern, especially when looking at the log file drive on a test server etc and questioning why so many logs were generated. IE - there wasn't enough traffic to generate 960 megs of logs, which is probably correct, but there was enough traffic to put a durable commit into each of those 960 logs such that we rolled and shipped them without being full in attempts to keep both sides up to date.
The third reason I pointed this out is that there seems to be confusion on when log roll should occur. This leads to people believing the log roll should occur no matter what, when as indicated it should only occur if the log contains a durable (hard) commit.
There are other operations besides user activity or a durable (hard) commit which will cause the current transaction log to roll:
I hope everyone finds this information helpful.
Recently there was a lively internal debate regarding how to use restore-storagegroupcopy and the –force switch.
The documentation regarding the restore-storagegroupcopy command can be found at http://technet.microsoft.com/en-us/library/aa996024.aspx.
According to the TechNet documentation:
“The Force parameter can be used when the task is run programmatically and prompting for administrative input is inappropriate. If Force is not provided in the cmdlet, administrative input is prompted. If Force is provided in the cmdlet, but the value is omitted, its default value is $true. When the Restore-StorageGroupCopy cmdlet is run to make an SCR target viable for mounting, the Force parameter must be included when the SCR source is not available.”
You’ll notice in this text that –force is required for standby continuous replication when the SCR source is not available.
So the first question is what constitutes the source being unavailable. In the most general terms the source is unavailable when the shares where the log files reside are not available such that the restore-storagegroupcopy command can be run and the remaining logs copied between machines.
For Windows 2003 based sources, and Windows 2008 non-shared storage clusters, the shares are generally not available when the entire machine is offline. For Windows 2008 shared storage clusters, the shares may not be available because their corresponding file server resources are offline in the clustered mailbox server group (for example, a stop-clusteredmailboxserver was issued taking the entire CMS offline, including the file server resources). Of course there are other reasons that shares may not be available, like network issues / misc hardware issues / etc.
The reason I point this out is that if the source is available, and the –force command is being used, we will not copy the delta logs over to the SCR source and mark the databases mountable. This effectively causes the database mount process to fail indicating log files necessary for recovery are not present. Manual recovery using eseutil /r /a would have to be performed in order for the databases to mount.
The second question is how can I overcome this limitation so this does not happen to me? The answer to that is simple. If you run the restore-storagegroupcopy without the –force we will attempt to copy delta logs. Should the source be unavailable, we will fail the copy procedure with a meaningful message indicating that the delta logs cannot be copied, and –force is necessary. After receiving this error you can repeat the restore-storagegroupcopy, this time specifying the –force. Since –force was required, the logs will not be copied (source unavailable) but the databases will be marked mountable.
Rule of Thumb: First try restore-storagegroupcopy and only run restore-storagegroupcopy –force if indicated to do so in the error text of the command.
Example of successful activation using restore-storagegroupcopy where the shares are available (no –force used).
Environment: Source cluster / target standalone.
The source clustered mailbox server was stopped using stop-clusteredmailboxserver.
An eseutil /ml of the source log directory was run, the end of the log file can be seen here. You will see that the log stream is complete through the E01.log.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000070.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000071.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000072.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000073.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000074.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000075.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000076.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E01.log - OK
No damaged log files were found.
Operation completed successfully in 14.921 seconds.
Prior to running the restore-storagegroupcopy an eseutil /ml was run against the logs on the SCR target. You will note that the same logs are present with the exception of the E01.log. (This is expected, even when the source CMS is shutdown gracefully the last log in the series is not copied to the SCR target.)
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000070.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000071.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000072.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000073.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000074.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000075.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000076.log - OK
Operation completed successfully in 7.63 seconds.
At this time the shares on the source are available, and the mailbox stores dismounted. A restore-storagegroupcopy –standbymachine <machine> is run and completes without error. The following events are noted in the application log.
Log Name: Application Source: MSExchangeRepl Date: 4/30/2009 8:20:16 AM Event ID: 2114 Task Category: Service Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The replication instance for storage group MBX-2\MBX-2-SG2 has started copying transaction log files. The first log file successfully copied was generation 119.
Log Name: Application Source: MSExchangeRepl Date: 4/30/2009 8:20:16 AM Event ID: 2085 Task Category: Action Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The Restore-StorageGroupCopy operation on MBX-2\MBX-2-SG2 was successful. All logs were successfully copied.
I then re-ran the eseutil /ml against the log series. You will note that after the restore-storagegroupcopy –standbymachine:<machine> that the e01.log is now present, it was successfully copied as a part of the restore process.
I followed up with an eseutil /ml of the target log directory, you can now see that the E01.log is present in the directory.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000071.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000072.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000073.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000074.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000075.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000076.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E01.log - OK
Operation completed successfully in 7.250 seconds.
The last operation was to mount the databases. At this time the databases mounted successfully – eseutil /r /a was not required.
Log Name: Application Source: MSExchangeIS Mailbox Store Date: 4/30/2009 8:25:06 AM Event ID: 9523 Task Category: General Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The Microsoft Exchange Database "MBX-3-SG2\MBX-3-SG2-DB1" has been started.
Database File: G:\MBX-2\MBX-2-SG2-Database\MBX-2-SG2-DB1.edb Transaction Logfiles: F:\MBX-2\MBX-2-SG2-Logs\ Base Name (logfile prefix): E01 System Path: E:\MBX-2\MBX-2-SG2-System\
Example of successful activation using restore-storagegroupcopy where the shares are not available (-force used).
The clustered nodes comprising the source solution were completely shutdown making them completely unavailable.
Prior to shutting the nodes down, after issuing a stop-clusteredmailboxserver, and eseutil /ml was run against the log directory. You will see the log stream is complete through E01.log.
Verifying log files... Base name: e01
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000092.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000093.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000094.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E01.log - OK
Operation completed successfully in 0.78 seconds
Prior to running the restore-storagegroupcopy, an eseutil /ml was run against the logs on the SCR target. You will note that the E01.log is not present.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000092.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000093.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000094.log - OK
Operation completed successfully in 0.64 seconds.
At this time a restore-storagegroupcopy –standbymachine <MACHINE> was issued. The following error was noted and expected since the source is no longer available.
[PS] G:\>Restore-StorageGroupCopy -Identity MBX-2\MBX-2-SG2 -StandbyMachine MBX-3 Restore-StorageGroupCopy : Restore failed to verify if the database on 'MBX-2' is mounted. Verify that the database is dismounted and then use the -Force parameter to restore the storage group copy. At line:1 char:25 + Restore-StorageGroupCopy <<<< -Identity MBX-2\MBX-2-SG2 -StandbyMachine MBX-3
After receiving an error that –force was necessary, the command was re-run using restore-storagegroupcopy –standbymachine –force. The following information was presented in the Exchange Management Shell window:
[PS] G:\>Restore-StorageGroupCopy -Identity MBX-2\MBX-2-SG2 -StandbyMachine MBX-3 -force WARNING: Performing a Restore-StorageGroupCopy operation on storage group 'MBX-2-SG2' with the Force option. Data loss is expected for this storage group.
The following events were noted in the application log:
Log Name: Application Source: MSExchangeRepl Date: 5/3/2009 10:37:39 AM Event ID: 2139 Task Category: Action Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The forced Restore-StorageGroupCopy operation on MBX-2\MBX-2-SG2 was successful. However, there may be some data loss.
After the command complete successfully, an eseutil /ml was performed against the log stream. You will note that the e01.log is not present in the target log directory, since the remaining logs could not be copied due to the SCR source being unavailable.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000092.log – OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000093.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000094.log - OK
At this time the database was successfully mounted as indicated by the following event in the application log.
Log Name: Application Source: MSExchangeIS Mailbox Store Date: 5/3/2009 10:44:06 AM Event ID: 9523 Task Category: General Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The Microsoft Exchange Database "MBX-3-SG2\MBX-3-SG2-DB1" has been started.
Example of successful activation using restore-storagegroupcopy where the shares are available (-force used).
Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007A.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007B.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007C.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007D.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007E.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E01.log - OK
Operation completed successfully in 16.219 seconds.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000079.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007A.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007B.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007C.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007D.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007E.log - OK
Operation completed successfully in 0.359 seconds.
At this time a restore-storagegroupcopy with the –force command was run. Please note: The source shares are available so –force is NOT NECESSARY. Here is sample Exchange Management Shell output.
[PS] C:\Windows\System32>Restore-StorageGroupCopy -Identity MBX-2\MBX-2-SG2 –StandbyMachine MBX-3 –force
WARNING: Performing a Restore-StorageGroupCopy operation on storage group 'MBX-2-SG2' with the Force option. Data loss is expected for this storage group.
The command completed successfully as indicated by returning to the Exchange Management Shell prompt without error. The following event was noted in the application log.
Log Name: Application Source: MSExchangeRepl Date: 5/1/2009 8:29:41 AM Event ID: 2139 Task Category: Action Level: Information Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: The forced Restore-StorageGroupCopy operation on MBX-2\MBX-2-SG2 was successful. However, there may be some data loss.
As follow up eseutil /ml was run against the logs on the SCR target machine. You will note that the E01.log was not copied even though the restore-storagegroupcopy –force command completed successfully.
Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000077.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000078.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E0100000079.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007A.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007B.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007C.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007D.log - OK Log file: F:\MBX-2\MBX-2-SG2-Logs\E010000007E.log - OK
Operation completed successfully in 0.187 seconds.
At this time a database mount attempt was performed, and failed with the following events noted in the application log.
Log Name: Application Source: MSExchangeIS Date: 5/1/2009 8:32:13 AM Event ID: 9518 Task Category: General Level: Error Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: Error Current log file missing starting Storage Group /DC=com/DC=domain/DC=domain/CN=Configuration/CN=Services/CN=Microsoft Exchange/CN=Organization/CN=Administrative Groups/CN=Exchange Administrative Group (FYDIBOHF23SPDLT)/CN=Servers/CN=MBX-3/CN=InformationStore/CN=MBX-3-SG2 on the Microsoft Exchange Information Store. Storage Group - Initialization of Jet failed.
Log Name: Application Source: ESE Date: 5/1/2009 8:32:13 AM Event ID: 455 Task Category: Logging/Recovery Level: Error Keywords: Classic User: N/A Computer: MBX-3.exchange.msft Description: MSExchangeIS (2984) MBX-3-SG2: Error -1811 (0xfffff8ed) occurred while opening logfile f:\MBX-2\MBX-2-SG2-Logs\E01.log.
The –1811 error translates to:
# for decimal -1811 / hex 0xfffff8ed JET_errFileNotFound # /* File not found */ JET_errFileNotFound # /* File not found */ JET_errFileNotFound # /* File not found */ # 3 matches found for "-1811"
In this case the –force command was improperly used resulting in logs not being copied to the SCR target. The databases could be mounted if they were manually recovered using eseutil /r /a or the logs manually copied to the SCR target.
This behavior is BY DESIGN. The –force command does not check to see if the SCR source is available, therefore no log file copy attempts are made.
There maybe time in either Windows 2003 or Windows 2008 where it may become necessary to evict a clustered node that has Exchange 2007 installed on it.
Under normal circumstances evicting a clustered node is a benign procedure. When the node has Exchange 2007 installed on it special precautions must be taken.
When Exchange 2007 is installed on a clustered node a special DLL (exres.dll) is registered for the cluster service. This dll contains the extensions in cluster that define the system attendant, information store, and databases instance clustered resources. You can see the resource definitions in the cluster registry hive (HKLM –> System –> Cluster –> ResourceTypes).
If you select one of the Exchange resource types, you will see that the DLL that defines it (DLLName) is exres.dll.
The resource types that are registered in a cluster are local to each node. When a node is evicted from a cluster, the local configuration is destroyed. If the node is joined back to an existing cluster, the Exchange resource types are no longer registered. This will effectively prevent this node from participating in the cluster.
In terms of Exchange there is no manual way to re-register the cluster extensions. Exchange 2007 does not have a reinstall procedure. If you attempt to rerun setup for the passive mailbox role, an error is generated indicating the role is already installed (because technically it is). In some cases you are able to uninstall the mailbox role successfully, where the uninstall is not successful though there are no manual removal steps that can be used. The worse case scenario is that the entire operating system must be rebuilt in order to facilitate installing Exchange.
To avoid this, use the following steps to successfully remove Exchange to facilitate evicting a clustered node:
1) Run setup.com /mode:uninstall /roles:mt,mb
(Note: MT is necessary to remove the management tools. By default, any role install also includes the management tools. By default, any uninstall only applies to the role specific – to have a complete removal you must specify both the mailbox role and management tools role.)
2) Evict the node from the cluster.
3) Re-join the node to the cluster.
4) Run setup.com /mode:install /roles:mailbox to re-establish the passive node mailbox role installation.
Occasionally I have an opportunity to work with our Exchange MVPs. Neil Hobson, one of our Exchange MVPs, has recently started a series on how to test SCR in a production environment.
If you have the time I would suggest checking it out!
Part 1: http://www.msexchange.org/articles_tutorials/exchange-server-2007/high-availability-recovery/testing-scr-production-environment-part1.html
Part 2: http://www.msexchange.org/articles_tutorials/exchange-server-2007/high-availability-recovery/testing-scr-production-environment-part2.html
Look for parts 3 and 4 in the upcoming weeks.
When running ExBPA against a Windows 2008 cluster (either CCR or SCC) the following best practices may be flagged:
These rules only apply to Windows 2003 clusters and the recommendations should not be followed for Exchange 2007 / Windows 2008 clusters.
This is scheduled to be corrected in a future ExBPA rules update.
As many probably already know Microsoft has announced the next version of Exchange – Exchange Server 2010.
Since I like to write about high availability, I thought I’d let you know of two things that are no longer present in Exchange 2010 from a high availability standpoint.
In Exchange 2010 the high availability landscape has changed to something known as the Database Availability Group.
I’ll write more about Exchange 2010 later and as things progress…but keep this information in mind as you think about your HA and Exchange 2010 deployments.
(An Exchange 2010 version of this article can be found here: http://blogs.technet.com/b/timmcmic/archive/2010/05/30/network-port-design-and-exchange-2010-database-availability-groups.aspx)
A question that I often see come up is how many network ports should I have in my clustered nodes and how should I use them.
I generally see three different hardware configurations:
In some hardware there are now 4 port cards. The information contained here can be expanded to include additional hardware / port configurations as they become available.
You’ll note that there is no configuration with a single network port – all clustered installations must at minimum have two network interfaces. (Note: VLANS to a single port are not two network interfaces).
When I’m advising customers the number of ports I recommend is dependant on the installation of Exchange.
Using these recommendations I’ll break down their uses below using Windows 2008 clustering terminology.
Network Teaming
In the recommendations I’ll outline next you will see references to the use of network teaming. It’s important to note that Microsoft does not support network teaming as this is hardware vendor supported and designed technology. What it is though is a recognition that in absence of anyway to provide multiple client facing ports for Exchange network teaming does have a valid place in the overall high availability design.
When using network teaming, only the client facing network should be a teamed adapter and at all times the team created for NETWORK FAULT TOLERANCE. Do not, for an Exchange instance, use any type of load balancing between ports.
For non-client facing networks it is not supported and not necessary to implement at network team (these would typically be your “heartbeat” networks). (Refer to: http://support.microsoft.com/kb/254101). Windows clustering has the ability to balance and use all interfaces on the cluster designated for cluster use without the need to establish teaming for cluster / heartbeat communications.
From a support perspective any customer that establishes a teamed interface for the client side network should recognize that they may be asked to dissolve the team to support troubleshooting efforts.
Exchange 2007 SP1 SCC (Single Copy Cluster) – Two Network Ports
When using a single copy cluster with two network ports, your options are limited. Consider the following design:
Exchange 2007 SP1 SCC (Single Copy Cluster) – Three Network Ports
This option provides you with some additional flexibility and also allows you to mitigate issues on the client facing network.
Exchange 2007 SP1 CCR (Cluster Continuous Replication) – Two Network Ports
When using a cluster continuous replication cluster with two network ports, your options are limited. Consider the following design:
You’ll noticed that in this configuration both networks are set to “allow clients to connect through this network”. This is necessary in order to establish the “private” network for use with log shipping functions.
To establish this network for log shipping functions, refer to the enable-continuousreplicationhostnames commandlet. (http://technet.microsoft.com/en-us/library/bb690985.aspx / http://technet.microsoft.com/en-us/library/bb629629.aspx)
If used the replication service will first prefer to perform log shipping functions over the “private” network. Should the private network be unavailable the replication service will resume log shipping functions over the “public” network.
Exchange 2007 SP1 CCR (Cluster Continuous Replication) – Three Network Ports
This option for CCR provides some additional flexibility. We can use a minimum of three ports to assist us in mitigating two factors that affect the overall high availability of this solution.
Considering using these ports in the following manner:
Exchange 2007 SP1 CCR (Cluster Continuous Replication) – Four Network Ports
This option provides even greater flexibility in terms of minimizing single points of failure and assisting to ensure log shipping functions can be successful.
I generally see the four port design implemented in two methods:
Method 1 allows for individuals to not use network teaming on the public interface. In this case the public interface is a single point of failure. It does though allow for two secondary replication networks for log shipping functions providing three overall paths for replication service use.
Method 2, which is my personal preferred method, allows for individuals to establish a team on the public interface. This provides high availability for the client facing network ports. In addition, the remaining two interfaces are enabled for continuous replication host names. This allows the replication service to have two secondary replication networks for log shipping functions providing three overall paths for replication service use.
You’ll noticed that in this configuration non-client facing networks are set to “allow clients to connect through this network”. This is necessary in order to establish the “private” network for use with log shipping functions.
I hope this information was helpful as you consider the number of network ports for your Exchange clustered nodes.
*Updates
5/19/9 – Two port CCR scenario updated to include recommendation to use continuous replication host name for second port.
With Exchange 2007 Cluster Continuous Replication clusters the recommended quorum type for the host cluster is Majority Node Set with File Share Witness (Windows 2003) or Node Majority and File Share Witness (Windows 2008).
In this blog post I want to talk about two things that have an influence on these decisions.
(Note: All information in this blog assumes a two node scenario since that is the maximum node count supported on Exchange 2007 CCR based clustered installations.)
The first item is placement of the file share witness.
In order for a two node solution to maintain quorum, we have to have a minimum of two votes. In our two node cluster scenarios we attempt to maintain two votes by locking the file share witness location. When a node has the ability to establish an SMB file lock on the file share witness, that node gets the benefit of the vote. The node that has the minimum two votes necessary has quorum, and will stay functional and host applications. The node that has the remaining one vote is lost quorum, and will terminate its cluster service.
When both nodes are in the same data center the placement of the file share witness is generally not an issue. When multiple data centers / physical locations are involved, where WAN connections are used to maintain connectivity between them, the placement of the file share witness is important.
In many scenarios customers are only dealing with a primary and secondary data centers. Generally I would recommend that the file share witness would be placed in the location where Exchange will service user accounts. In this case, if the link between the two nodes are down (for example – WAN failure), Exchange will stay functioning on the server where users will be serviced. This is due to the fact that two votes are available in the primary data center so that node has quorum, and only one vote is available in the secondary data center and that node has lost quorum. In the event that the primary data center is actually lost, and the secondary data center must be activated, users could follow the appropriate forceQuorum instructions for their operating system to force the solution online.
Considerations with the aforementioned scenario is that when connectivity is lost between the two data centers Exchange stays functioning in the primary data center. Manual activation of the secondary data center would be necessary in the event of full primary data center loss. Should the active node in the primary data center stop functioning, the solution would still function using the node in the secondary data center and the file share witness in the primary data center.
Another scenario is where the file share witness is placed in the secondary data center. When given the same WAN failure as outlined before, Exchange would automatically be moved to the node in the secondary data center since that is the only node that can maintain quorum (ie has two votes). The node in the primary data center does not have access to the file share witness, and will terminate it’s cluster services (lost quorum). This scenario does appeal to some. For example, should the primary data center be lost Exchange would automatically come online in the secondary data center. What I consider to be a drawback of this design is that any communications loss between the primary data center and the secondary data center would result in Exchange coming online only in the secondary data center automatically, and not being able to service users (assumes users use the same WAN connection between data centers). As in the previous scenario, should the WAN be functioning and the node lost in the secondary data center, Exchange would function in the primary data center using the file share witness in the remote data center to maintain quorum.
The last scenario is for customers that have at least three data centers. In this scenario, the assumption is that each data center has direct connectivity to each other (think triangle here). For example, Node A would be placed in DataCenter1, Node B in DataCenter2, and the File Share Witness in DataCenter3. Should DataCenter1 and DataCenter2 loss connectivity, each will have equal access to the file share witness. The first to successfully lock the file share witness gets the benefit of the vote, and can maintain quorum. Any node maintaining quorum in this scenario will continue to host existing applications, and arbitrate other applications from nodes that are lost quorum.
In the previous example you get automatic activation should either primary or secondary data center be unavailable, protection from a single WAN failure between any two datacenters, and automatic activation for any node failure.
In the first two examples above it is generally not relevant which node owns the cluster group. The ability to lock the file share witness is derived from it’s placement on either side of the WAN and the ability to maintain that WAN connection. It is in the three data center scenario that the location of the cluster group is of importance. Let’s take a look at that…
The second item – which node owns the cluster group (Applies to Windows 2003 Only).
In Windows 2003 the cluster group contains the cluster name, cluster IP address, and majority node set resource (configured to use file share witness).
If you review the private properties of the majority node set resource, you will see a timer value called MNSFileShareDelay. (cluster <clusterFQDN> res “Majority Node Set” /priv)
Cluster.exe cluster-1.exchange.msft res “Majortiy Node Set” /priv
Listing private properties for 'Majority Node Set':
S Majority Node Set MNSFileShare \\2003-DC1\MNS_FSW_Cluster-1
D Majority Node Set MNSFileShareCheckInterval 240 (0xf0)
D Majority Node Set MNSFileShareDelay 4 (0x4)
By default the MNSFileShareDelay is 4 seconds. You can configure this to a different value but in general this is not necessary.
When there is a condition where the two member nodes cannot communicate, and there is a need to use the file share witness to maintain quorum, the node that owns the cluster group gets the first change to lock the file share witness. The node that does not own the cluster group sleeps for MNSFileShareDelay – in this case 4 seconds.
The second item – which node owns the cluster group (Applies to Windows 2008 Only).
In Windows 2008 the cluster group is partially abstracted from the users. The items that comprise the cluster group – ip address, network name, and quorum resource are now known as cluster core resources.
Like Windows 2003, Windows 2008 also implements a delay for nodes not owning the cluster core resources when attempting to lock the file share witness.
If you review the private properties of the File Share Witness resource, you will see a value called ArbitrationDelay.
Listing private properties for 'File Share Witness (\\HT-2\MNS_FSW_MBX-1)':
S File Share Witness SharePath \\HT-2\MNS_FSW_MBX-1
(\\HT-2\MNS_FSW_MBX-1)
D File Share Witness ArbitrationDelay 6 (0x6)
The default arbitration delay value is 6 seconds and it is generally not necessary to change this value.
When there is a condition where the two member nodes (or greater since FSW can be used with more than two nodes in Windows 2008) can no longer communicate, and utilization of the file share witness is necessary in order to maintain quorum, the node that owns the cluster core resources gets the first attempt to lock the file share witness. Challenging nodes will sleep for 6 seconds before attempting to lock the witness directory.
So…why does this delay matter?
Take the example of the three data center scenario. Datacenter1 hosts NodeA currently running a clustered mailbox server, Datacenter2 hosts NodeB currently running the cluster group, and DataCenter3 hosts the file share witness. The link between DataCenter1 and DataCenter2 is interrupted, no interruption exists between DataCenter1 and DataCenter3 or DataCenter2 and DataCenter3 – all nodes have equal access to the file share witness. Since the cluster group is owned on NodeB, NodeB will immediately lock the file share witness. NodeA, since a lock already exists, will be unable to lock the file share witness and will terminate its cluster service. NodeB will arbitrate the Exchange resources and bring them online. Because of this delay, in the three location scenario, you may end up with results that were unexpected (for example, expecting NodeA to continue running Exchange without interruption).
When using the Exchange commandlets to manage cluster (move-clusteredmailboxserver) we do not take any actions in regards to the cluster group, we only act on the Exchange group. Taking into account the above example, you might find it necessary to modify how you move the Exchange and cluster resources between nodes. Let me give two examples of where you might modify how you move resources between nodes.
Example #1: You have a three data center scenario outlined before. Your primary client base accessing Exchange is in DataCenter1. You have decided to run Exchange on NodeB in DataCenter2. The cluster group remains on NodeA in DataCenter1. The link between DataCenter1 and DataCenter2 is interrupted. Connections from each data center to DataCenter3 are not impacted. NodeA, which owns the cluster resources, is first to lock the file share witness. NodeB, waiting it’s delay period, finds an existing lock and is unable to maintain quorum – the cluster service terminates. NodeA successfully arbitrates the Exchange resources. In this case by leaving the cluster group on the node in the main data center, when the link was lost Exchange came home so that user service could be continued.
Example #2: You have the three data center scenario outlined before. Your primary client base accessing Exchange is in DataCenter1. It is time to apply patches to your operating system requiring a reboot. You successfully apply the patches to NodeB in DataCenter2. Post reboot, you issue a move command for Exchange resources (move-clusteredmailboxserver –identity <CMSNAME> –targetNode NodeB) and the resources move successfully. You then patch NodeA and issue a reboot. During the reboot process, the cluster automatically arbitrates the cluster group to NodeB. When NodeA has completed rebooting, you issue a command to move the Exchange resources back to NodeA. Sometime after these moves occur the link between DataCenter1 and DataCenter2 is interrupted. The link between each data center and DataCenter3 is not impacted. NodeB, currently owning the cluster group, is allowed first access to the file share witness and is successful in establishing a lock. NodeA, which also has access, is unable to establish a lock and terminates its cluster service. In this case Exchange is moved from NodeA to NodeB (and presumably users are now cutoff from mail services since the link between DataCenter1 and DataCenter2 is not available).
Example #3: You have the three data center scenario outlined before. Your primary client base accessing Exchange is in DataCenter1. It is time to apply patches to your operating system requiring a reboot. You successfully apply the patches to NodeB in DataCenter2. Post reboot, you issue a move command for Exchange resources (move-clusteredmailboxserver –identity <CMSNAME> –targetNode NodeB) and the resources move successfully. You then patch NodeA and issue a reboot. During the reboot process, the cluster automatically arbitrates the cluster group to NodeB. When NodeA has completed rebooting, you issue a command to move the Exchange resources back to NodeA. You also issue a command to move the cluster group back to NodeA (presumably because you’ve read and understood this blog). (Cluster <clusterFQDN> group “Cluster Group” /moveto:<NODE>). Sometime after these moves occur the link between DataCenter1 and DataCenter2 is interrupted. The link between each data center and DataCenter3 is not impacted. NodeA, currently owning the cluster group, is allowed first access to the file share witness and is successful in establishing a lock. NodeB, which also has access, is unable to establish a lock and terminates its cluster service. In this case Exchange is not impacted.
In most installations I work on it is not necessary to manage the cluster group – both nodes are located at the same location with the file share witness in the same location as the nodes. If using multiple data centers, consider what is outlined here in the management of your Exchange and cluster resources.
***SEE UPDATE***
With the enhancements in Windows 2008 to allow for multi-subnet clustering it is becoming more common to see this utilized with Exchange 2007 SP1 installations.
When implementing a clustered solution, it is a requirement that there be a minimum of two interfaces on each node, and that each node can maintain communications across those interfaces. I see administrators implement this requirement in two different fashions with multi-subnet clusters:
If you are the second bullet, you’ll want to continue reading this blog. (If you are the first bullet you’ll probably want to read it anyway since you’ve made it this far…)
For users that have a configuration where both network interfaces are in different subnets this will generally require routing between those two subnets. A common mis-configuration that I see in this design is the use of default gateways on both of these network interfaces.
When a user attempts to configure two network interfaces each with a default gateway, the following error is noted from the operating system:
The text in this message is specifically important as it highlights at this time that this configuration will not produce the desired results.
The most likely cluster configuration where Exchange is used, with this type of clustering, is cluster continuous replication (CCR). When multiple default gateways are defined, users may see inconsistent results in the performance and ability to replicate logs between the nodes. The replication issues between nodes are also exacerbated when continuous replication hostnames are used utilizing the secondary networks with the default gateway assigned. These issues are secondary to any issues that the cluster service many have maintaining communications between the nodes and any communications issues clients may have connecting to the nodes.
If the default gateways are removed from the “private” adapters, reliable routed communications can only occur over the “public” interface. So…if two default gateways cannot be used, how should we ensure proper communications over both the “public” interface and “private” interface where both reside in different routed subnets.
The first part of this solution is to ensure that the binding order of the network interfaces is set correctly in the operating system. To confirm the binding order:
The second part of the solution is to maintain the default gateway on the “public” interface.
The third part of the solution is to enable persistent static routes on the “private” interfaces. In terms of the routes we simple need to configure routes to other “private” networks using gateway addresses that have the ability to route between those “private” networks. All other traffic not matching this route should be handled by the default gateway of the “public” adapter.
Let’s take a look at an example.
I desire to have a two node Exchange 2007 SP1 CCR cluster on Windows 2008 with each node residing in a different subnet.
NodeA:
Public
Private
NodeB:
Public:
(Note that gateway on network is not the default gateway setting but is the gateway on the private interface network that can route packets to the private network on the other nodes.)
In this case I would want to establish the necessary persistent static routes on each node. In order to accomplish this, I can use the route add command. The structure of the route command:
NodeA: Route add 10.0.1.0 mask 255.255.255.0 10.0.0.254 –p
NodeB: Route add 10.0.0.0 mask 255.255.255.0 10.0.1.254 –p
The –p switch will ensure that the routes are persistent lasting after a reboot. Failure to use the –p will result in the routes being removed post a reboot operation.
You can verify that the routes are correct by running route print and reviewing the persistent route information.
By utilizing only a default gateway on the “public” adapter, and static routes on the “private” adapters, you can ensure safe routed paths for client communications, cluster communications, and replication service log shipping.
========================================================
Update – 1-18-2010
With Windows 2008 and Windows 2008 R2 the recommendation to manage static routes has changed. Although route add should work the management of routes has technically been replaced with functionality in netsh. Therefore, it is a recommendation that the netsh commands be utilized to implement and manage static routes.
I will leave the previous information un-edited in the blog since many people have used it.
The first step in implementing static routes with the netsh command is to determine the interface names. The interface name is the logical name assigned to the network connection – for example Local Area Connection 1. It is recommended that these networks be named into something more logical, for example LAN-Replication-A. The logical network names may be the same on all nodes.
You can also determine that adapter name from an ipconfig /all. (Note the name listed below in RED)
Windows IP Configuration
Host Name . . . . . . . . . . . . : DAG-1 Primary Dns Suffix . . . . . . . : exchange.msft Node Type . . . . . . . . . . . . : Hybrid IP Routing Enabled. . . . . . . . : No WINS Proxy Enabled. . . . . . . . : No DNS Suffix Search List. . . . . . : exchange.msft
Ethernet adapter LAN:
Connection-specific DNS Suffix . : Description . . . . . . . . . . . : Microsoft Virtual Machine Bus Network Adapter Physical Address. . . . . . . . . : 00-15-5D-00-02-07 DHCP Enabled. . . . . . . . . . . : No Autoconfiguration Enabled . . . . : Yes Link-local IPv6 Address . . . . . : fe80::dd27:d7f6:549f:6b9b%11(Preferred) IPv4 Address. . . . . . . . . . . : 192.168.0.1(Preferred) Subnet Mask . . . . . . . . . . . : 255.255.255.0 IPv4 Address. . . . . . . . . . . : 192.168.0.2(Preferred) Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 192.168.0.254 DHCPv6 IAID . . . . . . . . . . . : 234886493 DHCPv6 Client DUID. . . . . . . . : 00-01-00-01-12-45-7C-F8-00-15-5D-00-02-07 DNS Servers . . . . . . . . . . . : 192.168.0.253 192.168.0.252 192.168.0.251 Primary WINS Server . . . . . . . : 192.168.0.253 Secondary WINS Server . . . . . . : 192.168.0.252 192.168.0.251 NetBIOS over Tcpip. . . . . . . . : Enabled
Ethernet adapter LAN-Replication-A:
Connection-specific DNS Suffix . : Description . . . . . . . . . . . : Microsoft Virtual Machine Bus Network Adapter #2 Physical Address. . . . . . . . . : 00-15-5D-00-02-08 DHCP Enabled. . . . . . . . . . . : No Autoconfiguration Enabled . . . . : Yes IPv4 Address. . . . . . . . . . . : 10.0.0.1(Preferred) Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : NetBIOS over Tcpip. . . . . . . . : Disabled
The netsh command format to add static routes looks like:
netsh interface ipv4 add route <IP/Mask> “InterfaceName” Gateway
Using the information from the above example, the following netsh commands would be utilized in place of route add:
NodeA: netsh interface ipv4 add route 10.0.1.0/24 “LAN-Replication-A” 10.0.0.254
NodeB: netsh interface ipv4 add route 10.0.0.0/24 “LAN-Replication-A” 10.0.1.254
The netsh command automatically assumes – unless otherwise specified in the command – that the route added is persistent.
If the command completes successfully the route addition can be verified by running:
netsh interface ipv4 show route
The following is sample output with the added route in RED (output truncated to show sample line including prefix and gateway):
C:\>netsh interface ip show route
Prefix Idx Gateway/Interface Name
------------------------ --- ------------------------
10.0.1.0/24 11 LAN-Replication-A
This is how the netsh command can be used to accomplish what would have previously been done with route add.
Update 9/18/2012:
Updated the netsh verification command to show correct syntax.