In my last post, I described about creating an actionable alert to a specific unit monitor - the status code monitor. You can do the same for all the other unit monitors. To that post, John Curtiss responded 'Availability aggregate rollups for the web application monitors are pretty useless'. John is right. The rollup simply says 'something is wrong in this web app' and it is down. For understanding why, let us look at the monitor tree. Below, is most of the monitor tree that forms health of a web application. The leaf nodes are the unit monitors and the health is rolled up to aggregate monitors. Unit monitors can be numeric, content match, numeric or security certificate related.
The aggregate monitor generated alert of the web application (Web app- URL) does not contain the precise description that identifies the exact cause of the problem. A web application alert could happen due to multiple failures. An alert is raised to indicate a problem and ideally it is one alert per problem. For example, if status code error caused the status code monitor to go error and then that caused the web app monitor to go error, it will generate the alert due to status code error. In the meantime, the status code got fixed but there was another failure – say certificate expired, the Web app monitor would still be error but due to a different problem. The alert would still remain in the same resolution state viz New, without a new alert being generated, as the Web app monitor remained in Error state. If the user had looked at an alert description that mentioned the first problem – status code, it may mislead them into thinking that it was the status code and not the certificate expiration. Alert is only indication of the problem and not assisting in diagnosis of the problem. Diagnosis is a complex process that may require additional data collection which is why connecting to the health explorer is the preferable method. At the aggregate level the problem may have triggered due to multiple causes whereas at the unit monitor level, we have precise indication of the problem. Hence, unit monitors can get more precise descriptions that indicate the problem, whereas at aggregate monitors, it is harder to create a precise description. If you think that majority of the problems are due to status code, I would recommend using the alert description that is stated in the feedback thread, but its hard to generalize a description of the alert at the aggregate level. And Alert is not intended to be the mechanism for live problem diagnostics.
Another factor to take into consideration is reduction of number of outstanding alerts in the system. Alerting at the aggregate level is meant to generate one alert at the application level instead of generating multiple alerts for each problem. Constant generation of alerts may be undesirable in most cases. Hence, by default we have disabled alerting on the unit monitor level. But users have the option to enabling the alert at every unit monitor that they need to. Alerts for monitors in sealed Management packs using overrides. One could develop a tool using the SDK that automates and applies the appropriate overrides for a large number of web applications
On the implementation level, there are optimizations in the monitoring infrastructure that are intentionally reducing unnecessary updates of monitor state for every state change notification unless the state is truly going to change from one state to another. In the above example, if monitor goes to error due to status code and then remains error due to another problem, there is no need to update the state from error to error and to generate an alert at the aggregate level. If we did that for every event that would generate a lot of state update notifications that could create other performance and scalability problems.
We are looking into ways of fixing the aggregate monitor alerts in one of our next releases to look at some options to make those alerts usable. Following questions may help me refine the proposal:
- Would it be okay if the alert description indicates the first error condition when the monitor went error/warning and created the alert but did not update subsequent state change events?
- What if the alert description is not updated after creation of alert but the history is modified with subsequent changes?
- What does the user want to determine the issue for the error after the error has gone away and resolved?
I would like to hear thoughts from the readers.
Next, let me look into bulk editing of configuration of the monitors.
There's more on alerts than unintelligible ones. First of all the fact that alerts changes get lost when health state is reset, but this is another story.
Immo alerts should report the last error message (not the first) and this is especially useful when the alert is threshold related, I need to know the last value not the first one. I fear is too expensive on the SQL side to have all the alert contexts in alert history, but if this is not the case (on the sql box) this could be a viable alternative. The point here is if I must choose between the first alert context and last one, the latter is what I need. In this process any alert change the operators did should be preserved.
And let’s do implement alert history the proper way, “alert modified by” is not enough. MOM2005 alert history was a lot more useful and the first source of auditing for alerts management.
I would much appreciate it if there was a way to bulk edit all of my web monitors. I have about 150 individual web monitors. I looked at editing them with .net by making the changes and then exporting the xml to see what was changed but it was much to complex.
I agree with Tim McFadden that bulk editting would be very useful. Ideally one would also have the ability to describe the expected state of their environment. To be more specific there are certain clustered web sites that may be stopped. Some sites due to be retired may be stopped. Finally our standard is to stop the default web site for security purposes. An easier way to convey this to Ops Mgr without using extensive overrides would be very helpful
We would like to see if there is a way to bulk enable the child monitors (IE Content Match) and also populate with alert discriptions.
Could I suggest that the aggregate monitors have other problems not just limited to what you represent as web app availability.
If I build a Distributed app, this app is a place (say branch office).
I add my router as snmp, which the switch is dependent, then the server(s), UPS, printer are dependent.
I would like it if you could say, the server failed but the router didn't so advise a certain group (Say Server team) but if the router or switch failed (which then in turn has everything fail), can this alert the Comms team.
This would be required of a LOB app, if the SQL component fails, you want the DB team informed and the Web team, and perhaps the application owner. If the Web part dies, you might alert the Web team only.
I am not sure if this is possible already....