With Operations Manager 2007, monitoring of Web based applications using a synthetic transaction approach was much easier with the inclusion of Web application monitoring template. This replaced the Web sites and services Management Pack in MOM 2005 and leveraged the benefits of the model based approach and state-centric monitoring available in OpsMgr. One of the side effects was that the Alerts raised when the watcher node detected a problem have a blank description. In the article, I will describe some tips to help you make the Alerts more meaningful and actionable.
Since I work on monitoring of Web based applications, I often hear from customers who are using the Web application templates that the default alert descriptions for the web applications are useless. When a Web application synthetic transaction fails whether it was an error status code or DNS lookup, that the Web application alert does not indicate anything useful.
In order to reproduce the problem, I created a simple web application with a non-existent URL. I selected all the default settings to create a new Web application. Within a couple of minutes, I received two alerts which showed up in the Alert view - One for the Web request and the other one for the Web application.
As you can see, neither of them had a good description. So what was the real cause of the problem? I had to click on the Health Explorer and review the monitor tree.
Now you can see that the Alert was raised because the HTTP Status code is 404 - not found. As a user, I have to look through the alerts, load up the health explorer and then identify the unit monitor causing the alert. That's too many steps. If I am forwarding this alert to a separate system from where I cannot launch the Health Explorer, then I am out of luck.
The problem is detected by a specific unit monitor which is not alerting by default. In the above case, it is the Status code monitor. One way I can make it generate an alert that is actionable is in simple steps
1. Right click and get properties of the monitor from the health explorer.
2. Enable the Alert for that monitor
3. Set the Alert description as follows
Status code is $Data/Context/RequestResults/RequestResult["1"]/BasePageData/StatusCode$
4. Reset the monitor. Close the alerts and wait for it to fire again.
Now I get a new Alert for status code with the detail of the failure that I can act upon. That is much more useful.
How did I find the magic description string for the alert? There is no magic to this - the information is in the context of the Alert as you can see in the State change Event above in the Health Explorer. (For more details on authoring alert descriptions, please refer to the Authoring guide). It has the request code in it. For the exact string in the context, use the following steps:
1. Go to the Authoring space and Edit the settings for the Web application. In the Web application editor, click on the Run Test link
2. Once the Test is run, click the 'View Full results'
3. Now click on the Raw tab to see full details.
4. Identify the status Code field (or any field that you are interested) that you wanted to see in the Alert description. Also note the fields name in the XML.
5. Construct the parameterized alert description based on the field you are interested using the Status code parameters as follows:
Now you know a little documented trick for creating actionable alert descriptions. Do you think this was useful?
If we know this, why did we not make the Web application description reflect this information , by default. For that, check my next blog post.
i've found that alerts from the Availability aggregate rollups for the web application monitors are pretty useless. i would rather have an alert that tells me which individual request failed (and now, with this description information available to me, *why* it failed) than just a blanket "hi this web app is down." the problem with turning on the unit monitor in your example is that now 2 alerts will be generated. so i always turn off alerting for the aggregate monitor and let the individual requests do the talking.
John, Excellent point.
Our long term approach for OpsMgr 2007, is to help minimize alerts, yet keep them meaningful. In my next blog, I intend to discuss the pros and cons of that approach and would love to have your thoughts on that discussion.
We have multiple web application monitors and we would like to bulk enable "content match" is this possible?
Really good to see that this problem is finally getting some attention.
The instructions about how to enable the monitor with a description would be ok if it was a matter of a few websites but in larger environments this is alot of work. And to have more than just the status code monitor active would require that you do all this work maybe five times for just one website or am I wrong?
Looking forward for more posts regarding this subject.
Another problem we have is the content match test when it fails it doesn’t show in the alert what content it was looking for, I have looked all over for the parameter for this any ideas?
I will try to address as many of the concerns in upcoming posts on this topic. Please keep giving the feedback.
I try to login to a website, do some actions and logout again. The problem is that I cannot just click a logout button. The logout button has a session ID in the URL.
Is there any possibility to grab a string out of the response body and hand it over to the next request? Is there any documentation on the parameters?
Great information but it's painful! By default you're already getting two alerts one from the aggregate roll up which tells you "hey website is down" and then from the action step that failed. But adding this would make a THIRD alert in which the alert name is just "Base page status code" which is useless if you have multiple web applications that you're monitoring.
What would actually be useful is if the agregate roll-up actually did it's job and "rolled up" all the alerts! Take the base status code and the action step that failed into ONE alert.
Thank you for taking the time to do this. A question--it appears that if you have multiple requests in one web application, only the first request provides details in Health Explorer. I confirmed this by reviewing the raw data returned from a test. Only one <StatusCode> existed, so the first one returned is reused if you choose to use
in the alert description of subsequent URL requests that fail. Is there any way to work around this, or is it by design?
It has been a while since I posted to the blog. The long break gave me a chance to work on some of the