Hi, I’m Charlie Satterfield, a Sr. Program Manager in the Management & Monitoring team here at Microsoft.  A bit of background on the team – we run a management and monitoring service for 5,000 servers and receive about 20,000 alerts on a weekly basis.  Our team provides this monitoring service to eight internal customers across Microsoft, including some of the more major global online properties at Microsoft, such as www.microsoft.com, TechNet, Windows update, and MSDN.  The purpose of this blog today is to share at a high level our overall monitoring strategy.

First off, to level set, if you talked with me a year ago, I’d say that our management and monitoring service was built mostly on System Center Operations Manager 2007 R2 and Service Manager 2010.  But, we were running that stack on physical hardware with minimal redundancy and failover capabilities – a legacy limitation we introduced purely from decisions around our architecture implementation. Today I can say we’re in the process of upgrading both our monitoring service and our ticketing service to System Center 2012 and we’re also moving much of our infrastructure to virtual machines and adding full redundancy in the stack.  We’re able to design an infrastructure that’s highly available with System Center 2012 thanks to the removal of the RMS role and the concept of management pools in Operations Manager. And finally, one of the changes we’re seeing in our Microsoft customers is the move from traditional application hosting models to leveraging private and public clouds as well as hybrid’s combining all three. The applications we now monitor are across the board.  So through this evolution, our monitoring implementation has changed a bit to accommodate this evolution.  We’ll go through how we approach monitoring, the changes we make to adjust to monitoring an application regardless of where it resides, and I’ll include more details on how we’re using System Center 2012.

Before we begin looking at the way we monitor traditional, public cloud, and hybrid applications I want to provide some information on our overall monitoring strategy.  Our monitoring strategy consists of two separate approaches:

  • Service perspective application monitoring  (Inside – Out)
    • Monitoring the actual code that is executed and delivered by the application –through Custom MP’s leveraging application events, instrumentation, and performance counters, as well as the performance of the underlying platform subsystems.  Monitoring from the service perspective allows the application owner to track leading indicators that can predict future issues or the need to increase capacity.  Now, new in System Center 2012 is our ability to expand our service to offer APM to our customers which will enable detailed application performance monitoring and exception tracking without instrumentation.
  •  Client perspective application monitoring (Outside – In)
    • End-user experiences related to application availability, response times,  and page load times derived from web application availability monitors and Synthetic Transaction testing using both simple and more complex user experiences.  Client perspective monitoring is the ultimate validation of application availability and performance as seen by the end user.

As we approach implementing this strategy some of the areas we are concerned about are:

  • Monitoring at the hardware level
  • Monitoring at the OS health/Subsystem level health
  • Monitoring at the application components on premise
  • Monitoring the application components in the cloud

As we move from network to cloud, and as we implement monitoring application components, we focus more on the application and less on the network/hardware layer.   The scope of monitoring decreases in the cloud, from an implementation standpoint, and we become more of a consumer of the cloud service and monitors.  The slide below shows how the monitoring priority shifts as we move across the application platforms.

clip_image001

For the traditional hosted application monitors, we have to take into consideration the entire scope of the application and the infrastructure.  The infrastructure monitoring includes the Hardware, Operating System, SQL monitors, and IIS. For this we leverage the base management packs with a few exceptions.  To monitor the application that runs on these traditional platforms we leverage custom MP’s, synthetic transactions to be able to test websites, and HTTP probes to test web services ensuring that the outside-in functionality of the application is available.  What I mean by HTTP probes is a synthetic transaction that interrogates a test webpage for status codes.  The test web page is actually exercising the functionality of the web service itself and returning success or error codes depending on the results.

For public cloud monitoring, the scope of monitoring focuses almost entirely on the application.  To monitor a public cloud application hosted in Windows Azure, we use a management server that sits on the edge and communicates with Windows Azure.  Using the management server on the edge, we’re able to monitor the application using the Azure MP and SQL Azure MP.  In addition, much like the on-premise application we continue to leverage Synthetic transactions and http probes.  Unlike traditional on-prem monitoring we don’t use an Operations Manager agent on Azure, instead we proxy through to Azure blob storage and leverage the Azure MP.  In order for the Azure application to be monitored by the Operations Manager Azure MP, Windows Azure Diagnostics must be enabled and must be configured to forward diagnostic data to a Windows Azure storage account.  For more information on how to configure the Azure MP and create rules and event monitors for Windows Azure applications please reference this article.

For hybrid application monitoring, the scope of monitoring includes aspects of both traditional and public cloud based application monitoring.  We leverage the same models of the traditional application monitoring including the base hardware, operating system, and IIS management packs.  We also continue to rely on synthetic transactions and http probes to monitor the availability of the application from the end user’s perspective.  In addition, the Azure and SQL Azure MP’s are used to monitor the public cloud specific portion of the application.  You might wonder how we’re able to understand the overall health of an application stretched across these platforms.  We’re able to get a single view of the health of these types of applications using the Operations Manager Distributed Application Model.  Leveraging the Distributed Application Model, we’re able to diagram out the subservices and roles of the application and assign unique health aggregations for each portion of the model.  For example, one of the many applications we monitor for our customers includes Windows Azure based web and web service roles, but the database for the application is located on premise on traditional hosting.  In this case we would create a Distributed Application Model with three subservices.  Two of the subservices would include monitors specific to Azure MP monitoring, Synthetic Transactions, and HTTP probes.  The third subservice would include monitors focused on not only SQL Server but also the underlying Operating System and Hardware.

Additional Monitoring Features We are Starting to Leverage in System Center 2012

In System Center 2012 we get a number of additional capabilities that give us the flexibility to monitor at the infrastructure and application layer without compromising the service or the subservice layers.  One of the most valuable additional capabilities in our business is the ability to use APM to monitor .NET applications running on IIS 7 for performance and capture exceptions without having to instrument the code. APM provides the ability to view graphical representations of the performance of the application including a breakdown of performance events based on duration.  In addition, APM provides the ability to highlight the top failures in the application including the ability to drill into each failure and display the stack trace down to the method call if symbols are available.  This functionality reduces the need for the application development team to reproduce the issue to generate similar data to triage and fix the issue.  It is because of this functionality that all of our internal Microsoft customers are eager to get their hands on APM for their applications.

On top of all the architecture and monitoring improvements we’re taking advantage of in System Center 2012, the new dashboard capabilities are allowing us to more easily create dashboards in a few clicks of the mouse that were much more difficult to create in Operations Manager 2007 R2. We plan on creating dashboards initially to view the availability and health of our own servers before offering dashboards to our internal customers.

We hope you’ve found this blog posting helpful when you look at monitoring in the world of 2012.

Charlie Satterfield
Sr. Program Manager
Management & Monitoring