I’ve written a few blog posts now that get into the deep technical details of Managed Availability. I hope you’ve liked them, and I’m not about to stop! However, I’ve gotten a lot of feedback that we also need some simpler overview articles. Fortunately, we’ve just completed documentation on TechNet with an overview of Managed Availability. This was written to address how the feature may be managed day-to-day.
Even that documentation doesn’t address how you respond when Managed Availability cannot resolve a problem on its own. This is the very most common interaction with Managed Availability, but we haven’t described how specifically to do so.
When Managed Availability is unable to recover the health of a server, it logs an event. Exchange Server has a long history of logging warning, error, and critical events into various channels when things go wrong. However, there are two things about Managed Availability events that make them more generally useful than our other error events:
When one of these events is logged on any server in our datacenters, a member of the product group team responsible for that health set gets an immediate phone call.
No one likes to wake up at 2 AM to investigate and fix a problem with a server. This keeps us motivated to only have Managed Availability alerts for problems that really matter, and also to eliminate the cause of the alert by fixing underlying code bugs or automating the recovery. At the same time, there is nothing worse than finding out about incidents from customer calls to support. Every time that happens we have painful meetings about how we should have detected the condition first and woken someone up. These two conflicting forces strongly motivate the entire engineering team to keep these events accurate and useful.
Along with a phone call, the on-call engineer receives an email with some information about the failure. The contents of this email are pulled from the event’s description.
The path in Event Viewer for these events is Microsoft-Exchange-ManagedAvailability/Monitoring. Error event 4 means that a health set has failed and gives the details of the monitor that has detected the failure. Information event 1 means that all monitors of a health set have become healthy.
The Exchange 2013 Management Pack for System Center Operations Manager nicely shows only the health sets that are currently failed instead of the Event Viewer’s method of displaying all health sets that have ever failed. SCOM will also roll up health sets into four primary health groups or three views.
This wouldn’t be EHLO without some in-depth PowerShell scripts. The event viewer is nice and SCOM is great, but not everyone has SCOM. It would be pretty sweet to get the same behavior as SCOM to show only the health sets on a server that are currently failed.
Note: these logs serve a slightly different purpose than Get-HealthReport. Get-HealthReport shows the current health state of all of a server’s monitors. On the other hand, events are only logged in this channel once all the recovery actions for that monitor have been exhausted without fixing the problem. Also know that these events detail the failure. If you’re only going to take action based on one health metric, the events in this log is a better one. Get-HealthReport is still the best tool to show you the up-to-the-minute user experience.
We have a sample script that can help you with this; it is commented in a way that you can see what we were trying to accomplish. You can get the Get-ManagedAvailabilityAlerts.ps1 script here.
Either this method or Event Viewer will work pretty well for a handful of servers. If you have tens or hundreds of servers, we really recommend investing in SCOM or another robust and scalable event-collection system.
My other posts have dug deeply into troubleshooting difficult problems, and how Managed Availability gives an overwhelmingly immense amount of information about a server’s health. We rarely need to use these troubleshooting methods when running our datacenters. However, the only thing you need to resolve Exchange problems the way we do in Office 365 is a little bit of event viewer or scheduled script.
Abram JacksonProgram Manager, Exchange Server