• Recommended Hotfixes for Windows Server 2012 Failover Clusters

    Today, Microsoft released Knowledge Base article 2784261, which describes the hotfixes we recommend being installed on all Windows Server 2012 failover clusters.  This includes Exchange database availability groups that are running on Windows Server 2012, as well.

    Although these hotfixes are generally recommended for all customers, we recommend that you evaluate each fix to determine whether it applies to your environment. If you determine a server in your organization is affected by the problem(s) that a fix addresses, install the fix on each cluster node by using the procedures that are described in Knowledge Base article 174799: Patching Windows Server Failover Clusters.

    The recommended hotfixes include two important updates:

    • 2848344 – This is an updated roll-up hotfix that replaces 2838669.  This fix is most important for any Hyper-V deployment, and critical for any deployment using DPM for backup.
    • 2838043 – This is a fix for NetName, that addresses a password synchronization issue.  This has been a top CSS call generator, where Repair was used in the past.

    The hotfixes we recommend for Windows Server 2012 failover clusters are as follows:

    Date added Title and Link Component Why we recommend this hotfix

    6/14/2013

    Update improves cluster resiliency in Windows Server 2012 Failover Cluster Service, CSV Filter and NTFS.sys This hotfix prevents a CSV failover by fixing an underlying issue in NTFS and by increasing the overall resilience of the Cluster service and CSV during expected pause states. Available for individual download.

    6/14/2013

    Can't access a resource that is hosted on a Windows 8 or Windows Server 2012-based failover cluster Failover Cluster Resource DLL This hotfix prevents an error when accessing resources hosted on a Windows 8-based or Windows Server 2012-based failover cluster from a Windows XP-based or Windows Server 2003-based client computer.

    It also resolves an Event ID 1196 with error The handle is invalid when the cluster network name resource fails to come online and register in DNS. Available for individual download.

    1/23/2013

    Failover Cluster Management snap-in crashes after you install update 2750149 on a Windows Server 2012-based failover cluster Failover Cluster Management snap-in Resolves a crash in the Failover Cluster Management snap-in after update 2750149 is installed on a Windows Server 2012-based failover cluster. Available from Windows Update or Microsoft Download Center.

    11/13/2012

    Windows 8 and Windows Server 2012 cumulative update: November 2012
    Multiple Improves clustered server performance and reliability in Hyper-V and Scale-Out File Server scenarios. Improves SMB service and client reliability under certain stress conditions. Install update 2770917 by using Windows Update in order to receive the cumulative update as described in KB 2770917.

    11/13/2012

    Error code when the kpasswd protocol fails after you perform an authoritative restore: "KDC_ERROR_S_PRINCIPAL_UNKNOWN”
    KDCSVC Install on every domain controller running Windows Server 2008 Service Pack 2 or Windows Server 2008 R2 in order to add a Windows Server 2012 failover cluster. Otherwise Create Cluster may fail when attempting to set the password for the cluster computer object with error message: CreateClusterNameCOIfNotExists (6783): Unable to set password on < ClusterName$> This hotfix is included in Windows Server 2008 R2 Service Pack 1.

    For more information, see http://support.microsoft.com/kb/2784261.

  • Witness Server Warning Message When Using Certain Database Availability Group Tasks

    Recently, some customers reported that when they create a DAG, they get a warning message that states the following:

    The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server <ServerName>.

    In these cases, the customer’s intended witness server was not an Exchange 2010 server.  As documented in TechNet, if the witness server you specify isn't an Exchange 2010 server, you must add the Exchange Trusted Subsystem (ETS) universal security group (USG) to the local Administrators group on the witness server. These security permissions are necessary to ensure that Exchange can create a directory and share on the witness server as needed.

    After some inspection, the customers confirmed that, contrary to the error message, the ETS USG was a member of the local administrators group on their intended witness server.  Moreover, even though this warning appeared, there were no ill effects in functionality.  The directory and share on the witness server were created as needed, the file share witness cluster resource was online, and the DAG passed all replication health checks.

    After hearing about this, I went to my lab to test this, and I was able to reproduce the issue.  I added the ETS USG to the local administrators group on my witness server (a Windows 2008 file server) and ran New-DatabaseAvailabilityGroup, specifying my witness server.  I received the same warning message, and verified that despite the message, all was perfectly healthy with the DAG, and there were no permission problems, witness server or cluster problems or other issues.

    Even though it appeared as though this warning message could be safely ignored, I wondered why we were getting it in the first place.  So I went digging into the source code to find out.

    Let me describe what is happening and why you, too, can safely ignore the warning message.

    During various DAG-related tasks that configure witness server properties (namely, New-DatabaseAvailabilityGroup, Set-DatabaseAvailabilityGroup and Restore-DatabaseAvailabilityGroup), the code is actually checking to see if the witness server is a member of the Exchange Trusted Subsystem USG.

    As you may know, there is no requirement that the witness server be a member of the ETS USG.  Nonetheless, the code for these tasks does check for this, and if it finds that the witness server is not a member of the ETS USG, it issues a warning message.

    Unfortunately, to confuse things even more, the warning message says:

    The Exchange Trusted Subsystem is not a member of the local Administrators group on specified witness server <ServerName>.

    It says nothing about the witness server not being a member of the ETS USG, even though the code is checking for that.  Instead, it makes it appear as though the permission perquisites have not been satisfied, even though they actually have.

    But, even though the message does not pertain to the actual check that failed, that does not make this a string bug.  This is a code bug, as there is no requirement that the witness server be a member of the ETS USG.  Thus, the code should not be checking for this condition.  If this bug is fixed and the check is removed, the string will be removed with it. Unless and until that happens, if you are seeing this warning message when you are using any of the above-mentioned tasks, and you have verified that the ETS USG is a member of the local administrators group on your witness server, then you can likely safely ignore the warning message. You should run Test-ReplicationHealth to verify the health of the DAG once members have been added to it.

    Because we are doing this check in code, you can of course add the witness server to the ETS group, and also make the ETS group a member of the local administrators group on the witness server, and all of these tasks will complete without this warning message. But, don't do that in production because (1) it is not needed and (2) it gives the witness server way more permissions than it should ever have (unless of course, the witness server is an Exchange 2010 server).

  • New Managed Availability Documentation Available

    Ensuring that users have a good email experience has always been the primary objective for messaging system administrators. To help ensure the availability and reliability of your messaging system, all aspects of the system must be actively monitored, and any detected issues must be resolved quickly.

    In previous versions of Exchange, monitoring critical system components often involved using an external application such as Microsoft System Center 2012 Operations Manager to collect data, and to provide recovery action for problems detected as a result of analyzing the collected data. Exchange 2010 and previous versions included health manifests and correlation engines in the form of management packs. These components enabled Operations Manager to make a determination as to whether a particular component was healthy or unhealthy. In addition, Operations Manager also used the diagnostic cmdlet infrastructure built into Exchange 2010 to run synthetic transactions against various aspects of the system.

    Exchange 2013 takes a new approach to monitoring and preserving the end user experience natively using a feature called Managed Availability that provides built-in monitoring and recovery actions.

    Overview

    Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in monitoring and recovery actions with the Exchange high availability platform. It's designed to detect and recover from problems as soon as they occur and are discovered by the system. Unlike previous external monitoring solutions and techniques for Exchange, managed availability doesn't try to identify or communicate the root cause of an issue. It's instead focused on recovery aspects that address three key areas of the user experience:

    • Availability   Can users access the service?
    • Latency   How is the experience for users?
    • Errors   Are users able to accomplish what they want?

    Managed availability is an internal process that runs on every Exchange 2013 server. It polls and analyzes hundreds of health metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there will always be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will escalate the issue to an administrator by means of event logging.

    For more information about this new feature, see the newly published topic Managed Availability.

    Health Sets

    From a reporting perspective, managed availability has two views of health, one internal and one external. The internal view uses health sets. Each component in Exchange 2013 (for example, Outlook Web App, Exchange ActiveSync, the Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes, monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A health set is a group of probes, monitors, and responders that determine if that component is healthy. The current state of a health set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If all of a health set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state, then the health set state will be determined by its least healthy monitor.

    For detailed steps to view server health or health sets state, see the newly published topic Manage Health Sets and Server Health.  For information on troubleshooting health sets, see this topic.

    Health Groups

    The external view of managed availability is composed of health groups. Health groups are exposed to System Center Operations Manager 2007 R2 and System Center Operations Manager 2012.

    There are four primary health groups:

    • Customer Touch Points Components that affect real-time user interactions, such as protocols, or the Information Store
    • Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange Mailbox Replication service, or the offline address book generation process (OABGen)
    • Server Components The physical resources of the server, such as disk space, memory and networking
    • Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS, etc.

    When the Exchange 2013 Management Pack is installed, System Center Operations Manager (SCOM) acts as a health portal for viewing information related to the Exchange environment. The SCOM dashboard includes three views of Exchange server health:

    1. Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor within SCOM. These appear as alerts in the Active Alerts view.
    2. Organization Health A rollup summary of the overall health of the Exchange organization health is displayed in this view. These rollups include displaying health for individual database availability groups, and health within specific Active Directory sites.
    3. Server Health Related health sets are combined into health groups and summarized in this view.

    Overrides

    Overrides provide an administrator with the ability to configure some aspects of the managed availability probes, monitors, and responders. Overrides can be used to fine tune some of the thresholds used by managed availability. They can also be used to enable emergency actions for unexpected events that may require configuration settings that are different from the out-of-box defaults.

    Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a group of servers (this is known as a global override). Server override configuration data is stored in the Windows registry on the server on which the override is applied. Global override configuration data is stored in Active Directory.

    Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global overrides can be configured to apply to all servers, or only servers running a specific version of Exchange.

    For detailed steps to view or configure server or global overrides, see Configure Managed Availability Overrides.

    When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service checks for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active Directory replication latency.

    Below are some examples of adding and removing global and server overrides:

    Example 1 - Make Information Store maintenance assistant alerts non-urgent for 60 days:

    Add-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass -PropertyValue 1 -Duration 60.00:00:00

    Example 2 - Change the maintenance assistant monitor to look for 32 hours of failures for 30 days:

    Add-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 115200 -Duration 30.00:00:00

    Example 3 - Remove the maintenance assistant monitor override added in Example 2:

    Remove-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds

    Example 4 - Remove the Information Store maintenance assistant alerts non-urgent override added in Example 1:

    Remove-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass

    Example 5 - Apply the database repeatedly mounting threshold override (change to 60 minutes) for a period of 60 days:

    Add-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 3600 -Duration 60.00:00:00

    Example 6 - Remove the database repeatedly mounting threshold override added in Example 5:

    Remove-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds

    Example 7 - Change the database dismounted alert from HA to Store for a period of 7 days:

    Add-GlobalMonitoringOverride -Identity Store\DatabaseAvailabilityEscalate -ItemType Responder -PropertyName ExtensionAttributes.Microsoft.Mapi.MapiExceptionMdbOffline -PropertyValue Store -Duration 7.00:00:00

    Example 8 - Disable VersionBucketsAllocated monitor for a period of 60 days:

    Add-GlobalMonitoringOverride -Identity Store\VersionBucketsAllocatedMonitor -ItemType Monitor -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00

    Example 9 - Update logs threshold in DatabaseSize monitor for a period of 60 days:

    Add-GlobalMonitoringOverride -Identity MailboxSpace\DatabaseSizeMonitor -ItemType Monitor -PropertyName ExtensionAttributes.DatabaseLogsThreshold -PropertyValue 100GB -Duration 60.00:00:00

    Example 10 - Applying a server override to disable quarantine monitor across all database copies for a period of 7 days:

    (get-mailboxDatabase <DB Name>).servers | %{Add-ServerMonitoringOverride -Server $_.name -Identity "Store\MailboxQuarantinedMonitor\<DB Name>" -ItemType Monitor -PropertyName Enabled -PropertyValue 0 -Duration:7.00:00:00 -Confirm:$false;}

    Management Tasks and Cmdlets

    There are three primary operational tasks that administrators will typically perform with respect to managed availability:

    • Extracting or viewing system health
    • Viewing health sets, and details about probes, monitors and responders
    • Managing overrides

    The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed availability logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson channel event logs, such as:

    • Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs.
    • Probe, monitor, and responder results, which are logged in the respective *Results event logs.
    • Details about responder recovery actions, including when the recovery action is started, and it is considered complete (whether successful or not), which are logged in the RecoveryActionResults event log.

    There are 12 cmdlets used for managed availability, which are described in the following table.

    Cmdlet Description
    Get-ServerHealth Used to get raw server health information, such as health sets and their current state (healthy or unhealthy), health set monitors, server components, target resources for probes, and timestamps related to probe or monitor start or stop times, and state transition times.
    Get-HealthReport Used to get a summary health view that includes health sets and their current state.
    Get-MonitoringItemIdentity Used to view the probes, monitors, and responders associated with a specific health set.
    Get-MonitoringItemHelp Used to view descriptions about some of the properties of probes, monitors, and responders.
    Add-ServerMonitoringOverride Used to create a local, server-specific override of a probe, monitor, or responder.
    Get-ServerMonitoringOverride Used to view a list of local overrides on the specified server.
    Remove-ServerMonitoringOverride Used to remove a local override from a specific server.
    Add-GlobalMonitoringOverride Used to create a global override for a group of servers.
    Get-GlobalMonitoringOverride Used to view a list of global overrides configured in the organization.
    Remove-GlobalMonitoringOverride Used to remove a global override.
    Set-ServerComponentState Used to configure the state of one or more server components.
    Get-ServerComponentState Used to view the state of one or more server components.
  • Using a Passive Node as an SCR Target

    As with local continuous replication (LCR) and cluster continuous replication (CCR), standby continuous replication (SCR) in Exchange 2007 Service Pack 1 uses the concept of storage group copies. Because SCR introduces the ability to have multiple copies of your data, we use slightly different terms to describe the replication endpoints.

    The starting point for a storage group that is enabled for SCR is called the SCR source. This can be any storage group, except a recovery storage group, on any of the following:

    • Stand-alone Mailbox server
    • Clustered mailbox server (CMS) in a single copy cluster (SCC)
    • CMS in a CCR environment

    The source must be running Exchange 2007 SP1.  When using a standalone Mailbox server as the SCR source, you can also have LCR enabled for one or more storage groups, including storage groups enabled for SCR.  You can have other roles (Client Access, Hub Transport, and/or Unified Messaging) installed, as well.

    The endpoint for SCR is called the target, and the target can be either of the following:

    • Stand-alone Mailbox server that does not have LCR enabled for any storage groups
    • Passive node in a failover cluster where the Mailbox role is installed, but no CMS has been installed in the cluster

    The target must also be running Exchange 2007 SP1.  There are other requirements, as well. See Standby Continuous Replication for more information on SCR.  In the case of both sources and targets, you can see the basic requirement for each: the Exchange 2007 SP1 Mailbox server role must be installed on both the source and target computers.

    The last bullet for the SCR target is the reason for this blog post.  There seems to be some confusion as to what we mean by a "Passive node in a failover cluster where the Mailbox role is installed, but no CMS has been installed in the cluster".

    To help explain what we mean, let me describe how Exchange is installed into a failover cluster.  You're probably familiar with the five server roles (Client Access, Hub Transport, Mailbox, Unified Messaging, and Edge Transport), but you might not realize there are two additional roles that can be installed, as well.  These "roles" are not Exchange server roles, but rather CMS roles: specifically, the active clustered mailbox role and the passive clustered mailbox role.

    The terms are used to tell Exchange Setup whether to install an active node or a passive node.  For Exchange Setup, installing an active node means installing the Mailbox server role, and then installing a CMS.  Installing a passive node means installing only the Mailbox server role.  You do not create or install a CMS when you install the passive clustered mailbox role.

    These roles are only expressed in the GUI version of Exchange Setup, so if you've installed your Exchange 2007 CMS' using only the command line version of Setup, you won't see these terms.  In the command line, you'll simply see Mailbox server and Clustered Mailbox Server.  It is the /newcms Setup option (and accompanying options) that dictate whether the active or passive clustered mailbox role is installed. If you include /newcms, the active clustered mailbox role is installed; if you do not use /newcms, the passive clustered mailbox role is installed.

    When we say you can use a "Passive node in a failover cluster where the Mailbox role is installed, but no CMS has been installed in the cluster" we mean a Windows failover cluster in which one or more nodes exist, but only the passive clustered mailbox role is installed.  You cannot have the active clustered mailbox role installed on any of the nodes in the failover cluster containing the SCR target(s).  You can see a picture of what this looks like here.

  • Exchange 2007 - Continuous Replication Architecture and Behavior

    I've previously blogged about the two forms of continuous replication that are built into Exchange 2007: Local Continuous Replication (LCR) and Cluster Continuous Replication (CCR).  In those blogcasts, you can see replication at work, but we really don't get into the architecture under the covers. So in this blog, I'm going to describe exactly how replication works, what the various components are, and what the replication pipeline looks like.

    As you may have heard or read, continuous replication is also known as "log shipping." In Exchange 2007, log shipping is the process of automating the replication of closed transaction log files from a production storage group (called the "active" storage group) to a copy of that storage group (called the "passive" storage group) that is located on a second set of disks (LCR) or on another server altogether (CCR). Once copied to the second location, the log files are then replayed into the passive copy of the database, thereby keeping the storage groups in sync with a slight time lag.

    In simple terms, log shipping follows these steps:

    1. Seed the source database in the destination to create a target database.
    2. Monitor for new logs in source log directory for copying by subscribing to Windows file system notification events for the directory.
    3. Copy any new log files to the destination log directory.
    4. Inspect the copied log files.
    5. After inspection is passed, move the log files the destination log directory and replay them into the copy of the database.

    Microsoft Exchange Replication Service

    Exchange 2007 implements log shipping using the Microsoft Exchange Replication Service (the "Replication service"). This service is installed by default on the Mailbox server role. The executable behind the Replication service is called Microsoft.Exchange.Cluster.ReplayService.exe, and its located at <install path>\bin. The Replication service is dependent upon the Microsoft Exchange Active Directory Topology Service. The Replication service can be stopped and started using the Services snap-in or from the command line. The Replication service is also configured to be automatically restarted in case of a failure or exception.

    Running Replication Service in Console Mode

    The Replication service can be started as service or as a console application., But note, that running the service as a console application is strictly for troubleshooting and debugging purposes. This is not something that would be done as a regular administrative task. In console mode the replication process check for two parameters: -console and -noprompt.

    -Console

    If the console switch is specified or no default parameter is provided then the process will check to see if it is started up as service or console application. This is done by looking at the SIDs in the tokens of the process. If the process has a service SID, or no interactive SID, the process is considered to be running as a service.

    -NoPrompt

    By default, a shutdown prompt is on. You use the -noprompt switch to disable the shutdown prompt.

    The Replication Service Internals

    The Replication service is a managed code application that runs in the Microsoft.Exchange.Cluster.ReplayService.exe process.

     

    Replication Service Registry Values

    The Replication service keeps track of a storage group that is enabled for replica by keeping that information in the registry. The storage group replica information is stored the registry with the Object GUID of the storage group.

    State 

    The replay state of storage group that has the continuous replication enabled is stored at HKLM\Software\Microsoft\Exchange\Replay\State\GUID.

    StateLock

    Each replica state is controlled via a StateLock to make sure that the access to the state information is gated. As its name implies, StateLock is used to manipulate a state lock from inside the Replication service. There are two StateLocks created per storage group: one for the database file and one for the log files. These locks states are stored at HKLM\Software\Microsoft\Exchange\Replay\StateLock\GUID.

    Replication Service Diagnostics Key

    The Replication service stores its configuration information regarding diagnostics at HKLM\System\CCS\Services\MSExchange Repl\Diagnostics.

    You can query the current diagnostic level for the Replication service using an Exchange Management Shell command: get-EventLogLevel -Identity "MsExchange Repl".  This will also return the diagnostic level for the Replication service's Exchange VSS Writer, which is another subject altogether (maybe something for a future blog).

    Replication Service Configuration Information in Active Directory

    The Replication service uses the msExchhasLocalCopy attribute to identify which storage groups are enabled for replication in an LCR environment. msExchhasLocalCopy will be set at the database level, as well.

    In a CCR environment, the Replication service uses the cluster database to store this information.

    The Replication service uses an algorithm to search Active Directory for replica information:

    1. Find the Exchange Server object in the Active Directory using the computer name. If there is no server object then return.
    2. Enumerate all storage groups that are on this Exchange server.
    3. For each storage group with msExchhasLocalCopy set to true:

    a. Read the msExchESEParamCopySystemPath and msExchESEParamCopyLogFilePath attributes of the storage group.

    b. Read the msExchCopyEdbFile attribute for each database in the storage group

    Replication Components

    The Replication Service implements log shipping by using several components to provide replication between the active and passive storage groups.

    Replication Service Object Model

    The Replication service is responsible for creating an instance of the replica associated with a storage group. The object model below shows the different objects that are created for each storage group copy.

    Continuous Replication - Replication Object Model

    In a CCR environment, the Replication service runs on both the active node and the passive node.  As a result, both an active and a passive replica instance will be created.

    Copier 

    The copier is responsible for copying closed log files from the source to destination. This is an asynchronous operation in which the Replication service continuously monitors the source. As soon as new log file is closed on the source, the copier will copy the log file to the inspector location on the target.

    Inspector

    The inspector is responsible for verifying that the log files are valid. It checks the destination inspector directory on a regular basis. When a new log file is available, it will be checked (checksummed for validity) and then copied to the database subdirectory. If a log file is found to be corrupt, the Replication service will request a re-copy of the file.

    LogReplayer

    The logreplayer is responsible for replaying log files into the passive database. It also has the ability to batch multiple log files into a single batch replay. In LCR, replay is performed on the local machine, whereas with CCR, replay is performed on the passive node. This means that the performance impact of replay is higher on for LCR than CCR.

    Truncate Deletor

    The truncate deletor is responsible for deleting log files that have been successfully replayed into the passive database. This is especially important after an online backup is performed on the active copy since online backups delete log files are not required for recovery of the active database. The truncate deleter makes sure that any log files that have not been replicated and replayed into the passive copy are not deleted by an online backup of the active copy.

    Incremental Reseeder

    The incremental reseeder is responsible for ensuring that the active and passive database copies are not diverged after a database restore has been performed, and after a failover in a CCR environment.

    Seeder

    The seeder is responsible for creating the baseline content of a storage group used to start replay processing. The Replication service perform automatic seeding for new storage groups.

    Replay Manager

    The replay manager is responsible for keeping track of all replica instances. It will create and destroy the replica on-demand based on the online status of the storage group. The configuration of a replica instance is intended to be static; therefore, when a replica instance configuration is changed the replica will be restarted with the updated configuration. In addition, during shutdown of the Replication service, the configuration is not saved. As a result, each time the Replication service starts it has an empty replica instance list. When the Replication service starts, the replay manager does discovery of the storage groups that are currently online to create a "running instance" list.

    The replay manager periodically runs a "configupdater" thread to scan for newly configured replica instances. The configupdater thread runs in the Replication service process every 30 seconds. It will create and destroy a replica instance based on the current database state (e.g., whether the database is online or offline. The configupdater thread uses the following algorithm:

    1. Read instance configuration from Active Directory
    2. Compare list of configurations found in Active Directory against running storage groups/databases
    3. Produce a list of running instances to stop and a list of configurations to start
    4. Stop running instances on the stop list
    5. Start instances on the start list

    Effectively, therefore, the replay manager always has a dynamic list of the replica instances.

    Replication Pipeline

    The replication pipeline implemented by the Replication service is shown below. In an LCR environment, the source database and target database are on the same machine. In a CCR environment, the source and target database are on different machines (different nodes in the same failover cluster).

    Continuous Replication - Replication Pipeline

    Log Shipping and Log File Management

    The Replication service uses an Extensible Storage Engine (ESE) API to inspect and replay log files that are copied over from the active storage group to the passive storage group. Once the log files are successfully copied to the inspector directory, the log inspector object associated with the replica instance verifies the log file header. If the header is correct, the log file will be moved to the target log directory and then replayed into the passive copy of the database.

    Log Shipping Directory Structure

    The Replication service creates a directory structure for each storage group copy. This per-storage group directory structure is identical in both LCR and CCR environments, with one exception: in a CCR environment, a content index catalog directory is also created.

    Inspector Directory

    The Inspector directory contains log files copied by the Copier component. Once the log inspector has verified that a log file is not corrupt, the log file will be copied to the storage group copy directory and replayed in the passive copy of the database.

    IgnoredLogs Directory

    The IgnoredLogs directory is used to keep valid files that cannot be replayed for any reason (e.g., the log file is too old, the log file is corrupt, etc.). The IgnoredLogs might also have the following subdirectories:

    E00OutofDate

    This is the subdirectory that holds any old E00.log file that was present on the passive copy at the time of failover. An E00.log file is created on the passive if it was previously running as an active. An event 2013 is logged in the Application event log to indicate the failure.

    InspectionFailed

    This is the subdirectory that holds log files that have failed inspection. An event 2013 is logged when a log file fails inspection. The log file is then moved to the InspectionFailed directory. The log inspector uses Eseutil and other methods to verify that a log file is physically valid. Any exception returned by these checks will be considered as a failure and the log file will be deemed to be corrupt.

    Well, there you have it.  I hope you found this useful and informative.