Ensuring that users have a good email experience has always been the primary objective for messaging system administrators. To help ensure the availability and reliability of your messaging system, all aspects of the system must be actively monitored, and any detected issues must be resolved quickly.

In previous versions of Exchange, monitoring critical system components often involved using an external application such as Microsoft System Center 2012 Operations Manager to collect data, and to provide recovery action for problems detected as a result of analyzing the collected data. Exchange 2010 and previous versions included health manifests and correlation engines in the form of management packs. These components enabled Operations Manager to make a determination as to whether a particular component was healthy or unhealthy. In addition, Operations Manager also used the diagnostic cmdlet infrastructure built into Exchange 2010 to run synthetic transactions against various aspects of the system.

Exchange 2013 takes a new approach to monitoring and preserving the end user experience natively using a feature called Managed Availability that provides built-in monitoring and recovery actions.

Overview

Managed availability, also known as Active Monitoring or Local Active Monitoring, is the integration of built-in monitoring and recovery actions with the Exchange high availability platform. It's designed to detect and recover from problems as soon as they occur and are discovered by the system. Unlike previous external monitoring solutions and techniques for Exchange, managed availability doesn't try to identify or communicate the root cause of an issue. It's instead focused on recovery aspects that address three key areas of the user experience:

  • Availability   Can users access the service?
  • Latency   How is the experience for users?
  • Errors   Are users able to accomplish what they want?

Managed availability is an internal process that runs on every Exchange 2013 server. It polls and analyzes hundreds of health metrics every second. If something is found to be wrong, most of the time it will be fixed automatically. But there will always be issues that managed availability won’t be able to fix on its own. In those cases, managed availability will escalate the issue to an administrator by means of event logging.

For more information about this new feature, see the newly published topic Managed Availability.

Health Sets

From a reporting perspective, managed availability has two views of health, one internal and one external. The internal view uses health sets. Each component in Exchange 2013 (for example, Outlook Web App, Exchange ActiveSync, the Information Store service, content indexing, transport services, etc.) is monitored by managed availability using probes, monitors, and responders. A group of probes, monitors and responders for a given component is called a health set. A health set is a group of probes, monitors, and responders that determine if that component is healthy. The current state of a health set (e.g., whether it is healthy or unhealthy) is determined by using the state of the health set’s monitors. If all of a health set’s monitors are healthy, then the health set is in a healthy state. If any monitor is not in a healthy state, then the health set state will be determined by its least healthy monitor.

For detailed steps to view server health or health sets state, see the newly published topic Manage Health Sets and Server Health.  For information on troubleshooting health sets, see this topic.

Health Groups

The external view of managed availability is composed of health groups. Health groups are exposed to System Center Operations Manager 2007 R2 and System Center Operations Manager 2012.

There are four primary health groups:

  • Customer Touch Points Components that affect real-time user interactions, such as protocols, or the Information Store
  • Service Components Components without direct, real-time user interactions, such as the Microsoft Exchange Mailbox Replication service, or the offline address book generation process (OABGen)
  • Server Components The physical resources of the server, such as disk space, memory and networking
  • Dependency Availability The server’s ability to access necessary dependencies, such as Active Directory, DNS, etc.

When the Exchange 2013 Management Pack is installed, System Center Operations Manager (SCOM) acts as a health portal for viewing information related to the Exchange environment. The SCOM dashboard includes three views of Exchange server health:

  1. Active Alerts Escalation Responders write events to the Windows event log that are consumed by the monitor within SCOM. These appear as alerts in the Active Alerts view.
  2. Organization Health A rollup summary of the overall health of the Exchange organization health is displayed in this view. These rollups include displaying health for individual database availability groups, and health within specific Active Directory sites.
  3. Server Health Related health sets are combined into health groups and summarized in this view.

Overrides

Overrides provide an administrator with the ability to configure some aspects of the managed availability probes, monitors, and responders. Overrides can be used to fine tune some of the thresholds used by managed availability. They can also be used to enable emergency actions for unexpected events that may require configuration settings that are different from the out-of-box defaults.

Overrides can be created and applied to a single server (this is known as a server override), or they can be applied to a group of servers (this is known as a global override). Server override configuration data is stored in the Windows registry on the server on which the override is applied. Global override configuration data is stored in Active Directory.

Overrides can be configured to last indefinitely, or they can be configured for a specific duration. In addition, global overrides can be configured to apply to all servers, or only servers running a specific version of Exchange.

For detailed steps to view or configure server or global overrides, see Configure Managed Availability Overrides.

When you configure an override, it will not take effect immediately. The Microsoft Exchange Health Manager service checks for updated configuration data every 10 minutes. In addition, global overrides will be dependent on Active Directory replication latency.

Below are some examples of adding and removing global and server overrides:

Example 1 - Make Information Store maintenance assistant alerts non-urgent for 60 days:

Add-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass -PropertyValue 1 -Duration 60.00:00:00

Example 2 - Change the maintenance assistant monitor to look for 32 hours of failures for 30 days:

Add-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 115200 -Duration 30.00:00:00

Example 3 - Remove the maintenance assistant monitor override added in Example 2:

Remove-GlobalMonitoringOverride -Identity Store\DirectoryServiceAndStoreMaintenanceAssistantMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds

Example 4 - Remove the Information Store maintenance assistant alerts non-urgent override added in Example 1:

Remove-GlobalMonitoringOverride -Identity Store\MaintenanceAssistantEscalate -ItemType Responder -PropertyName NotificationServiceClass

Example 5 - Apply the database repeatedly mounting threshold override (change to 60 minutes) for a period of 60 days:

Add-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds -PropertyValue 3600 -Duration 60.00:00:00

Example 6 - Remove the database repeatedly mounting threshold override added in Example 5:

Remove-GlobalMonitoringOverride -Identity Store\DatabaseRepeatedMountsMonitor -ItemType Monitor -PropertyName MonitoringIntervalSeconds

Example 7 - Change the database dismounted alert from HA to Store for a period of 7 days:

Add-GlobalMonitoringOverride -Identity Store\DatabaseAvailabilityEscalate -ItemType Responder -PropertyName ExtensionAttributes.Microsoft.Mapi.MapiExceptionMdbOffline -PropertyValue Store -Duration 7.00:00:00

Example 8 - Disable VersionBucketsAllocated monitor for a period of 60 days:

Add-GlobalMonitoringOverride -Identity Store\VersionBucketsAllocatedMonitor -ItemType Monitor -PropertyName Enabled -PropertyValue 0 -Duration 60.00:00:00

Example 9 - Update logs threshold in DatabaseSize monitor for a period of 60 days:

Add-GlobalMonitoringOverride -Identity MailboxSpace\DatabaseSizeMonitor -ItemType Monitor -PropertyName ExtensionAttributes.DatabaseLogsThreshold -PropertyValue 100GB -Duration 60.00:00:00

Example 10 - Applying a server override to disable quarantine monitor across all database copies for a period of 7 days:

(get-mailboxDatabase <DB Name>).servers | %{Add-ServerMonitoringOverride -Server $_.name -Identity "Store\MailboxQuarantinedMonitor\<DB Name>" -ItemType Monitor -PropertyName Enabled -PropertyValue 0 -Duration:7.00:00:00 -Confirm:$false;}

Management Tasks and Cmdlets

There are three primary operational tasks that administrators will typically perform with respect to managed availability:

  • Extracting or viewing system health
  • Viewing health sets, and details about probes, monitors and responders
  • Managing overrides

The two primary management tools for managed availability are the Windows Event Log and the Shell. Managed availability logs a large amount of information in the Exchange ActiveMonitoring and ManagedAvailability crimson channel event logs, such as:

  • Probe, monitor, and responder definitions, which are logged in the respective *Definition event logs.
  • Probe, monitor, and responder results, which are logged in the respective *Results event logs.
  • Details about responder recovery actions, including when the recovery action is started, and it is considered complete (whether successful or not), which are logged in the RecoveryActionResults event log.

There are 12 cmdlets used for managed availability, which are described in the following table.

Cmdlet Description
Get-ServerHealth Used to get raw server health information, such as health sets and their current state (healthy or unhealthy), health set monitors, server components, target resources for probes, and timestamps related to probe or monitor start or stop times, and state transition times.
Get-HealthReport Used to get a summary health view that includes health sets and their current state.
Get-MonitoringItemIdentity Used to view the probes, monitors, and responders associated with a specific health set.
Get-MonitoringItemHelp Used to view descriptions about some of the properties of probes, monitors, and responders.
Add-ServerMonitoringOverride Used to create a local, server-specific override of a probe, monitor, or responder.
Get-ServerMonitoringOverride Used to view a list of local overrides on the specified server.
Remove-ServerMonitoringOverride Used to remove a local override from a specific server.
Add-GlobalMonitoringOverride Used to create a global override for a group of servers.
Get-GlobalMonitoringOverride Used to view a list of global overrides configured in the organization.
Remove-GlobalMonitoringOverride Used to remove a global override.
Set-ServerComponentState Used to configure the state of one or more server components.
Get-ServerComponentState Used to view the state of one or more server components.