Many time Operators team manages problem regarding OpsMgr2007 Alerts views.
When an alert is raised, an operator acknowledges the alert, manages the issue, and closes the alert if not already done.
The following chapters explain why this can be an issue, and how to manage it.
2 types of alert are managed in OpsMgr2007.
Ø Alert generated by Rule
Ø Alert generated by Monitor
This type of alert is generated by an OpsMgr2007 rule which doesn’t affect the health of the target object (Alert Source).
In alert details windows, a link appears on the Alert Rule:
Alert generated by rule can be configured to consolidate if the problem is raised again. In this case the Repeat Count field is updated:
Alert context tab contains the last event which has been used to generate the alert:
Auto resolved alert process affects only alert generated by rule, if alert is still in state NEW:
Generally rules are used to collect information, as events or performance counters. This information is used for troubleshooting, analysis, capacity planning, reporting …
Rules are also sometime used for proactive monitoring, in this case the rule is configured to generate alert.
Converted management packs can also content rules which generate reactive alerts as MOM2005 didn’t have monitor concept.
This type of alert is auto resolved only by the OpsMgr2007 auto resolved process, if this one is still in resolution state “New”, or if the alert source is healthy.
Ø Health of target object is not affected
Ø Alert context contains the last event
Ø Auto resolved if the resolution state still NEW, or if alert source is healthy
Ø No auto resolved when issue is solved
This type of alert is generated by an OpsMgr2007 Monitor which affects the health of the target object (Alert Source).
In alert details windows, a link appears on the Alert Monitor:
Repeat count field is never updated.
Alert is used as a notification when the monitor updates the heath state of an object to Healthy to Critical (or Warning).
If alert is closed manually, the Monitor heath state of related object is not updated to Healthy, and if the problem still occurs, the monitor will never generate a new alert.
Therefore an alert generated by a monitor, rather than a rule, should not have its alert closed manually but the alert should be managed by the health of the target object. If the health returns to healthy then the alert will automatically close.
Alert generated by a monitor is also closed by the OpsMgr2007 auto resolved process if the resolution state is still New, however the health state of the alert source is not updated.
The alert context tab contains the event which has been used to change the monitor health state and generate the alert:
Ø Never close manually an alert generated by a Monitor
Ø Manage the problem by using Health Explorer
Ø Alert is automatically closed if problem is solved and if monitor has received the configured healthy event
Ø Alert closed automatically by OpsMgr connector do not reset the monitor heath state.
Ø Alert is also closed by OpsMgr2007 auto resolved process if the resolution state is still New, however the health state of the alert source is not updated.
As it is not possible to prevent an alert that has been generated by a monitor being closed by an operator and therefore not possible to ensure that the health state of the monitor has also been reset to healthy, I have developed two tools to manage this behavior:
This tool scans all closed monitor alerts and checks the state of the related monitor, and if the monitor is not healthy, the state is reset. At the next occurrence of the monitor after this has run, if the issue in question is still occurring then a new alert will be raised.
Also, to be sure that the scanned alert is the alert related to the last time the monitor state has changed, the tool will compare the time the alert was added and the last health state change value of the monitor. This needs to be less than 90 seconds, which is a reasonable indicator that this alert and health state change are related.
Alert.TimeAdded - (DateTime)monitoringObject.GetMonitoringStates(monitors).LastTimeModified).TotalSeconds)) < 90
This tool can be launched from the command line on the RMS server.
Without option, the tool doesn’t reset any monitor, but shows all monitors that should be reset.
==== Reset Monitor!!!
Monitoring ObjectPath: OM2007R2.dom02.com
Alert Name: SPEC - Monitor Object from Syslog Event (critical/information)
Alert ResolvedBy: DOM02\Administrator
Alert TimeAdded: 13.07.2009 16:14:15
Monitor DisplayName: SPEC - Monitor Object from Syslog Event (critical/information)
Monitor HealthState: Error
Monitor LastTimeModified: 13.07.2009 16:14:15
With option –r, all detected monitor will be reset.
Using the same principle, this tool takes in argument of an alert ID, and if it is an alert raised by a monitor which has been subsequently closed, it checks the state of the related monitor, and if the monitor is not healthy, the state is reset.
This tool can be launched automatically by creating a notification channel.
The detail of this implementation is explained in chapters below.
Create a new command notification channel
Log Name: Operations Manager
Source: OpsMgr2007 ResetMonitorFromAlertId
Date: 13.07.2009 16:12:46
Event ID: 1000
Task Category: None
<Provider Name="OpsMgr2007 ResetMonitorFromAlertId" />
<TimeCreated SystemTime="2009-07-13T14:12:46.000Z" />
Date: 13.07.2009 16:12:49
Manage Alert with GUID: 5fd07143-a3ac-4eb2-8897-b73b6a80fa6e
<TimeCreated SystemTime="2009-07-13T14:12:49.000Z" />
<Data>Manage Alert with GUID: 5fd07143-a3ac-4eb2-8897-b73b6a80fa6e</Data>
Date: 13.07.2009 16:12:55
Event ID: 1010
Monitor resets by ResetMonitorfromAlertId
MonitorDisplayName: SPEC - Monitor Object from Syslog Event (critical/information)
AlertName: SPEC - Monitor Object from Syslog Event (critical/information)
<TimeCreated SystemTime="2009-07-13T14:12:55.000Z" />
<Data>Monitor resets by ResetMonitorfromAlertId
A rule can be created to collect the following event from RMS.
Root Management Server
As it’s described in the following blog article the hardcoded limit is maximum 5 asynchronous responses in OpsMgr2007 SP1.
So if more than 5 alerts are closed in the same time, the following event should appear in Operation Manager Event log on the RMS server.
Alerts “Script or Executable was Dropped”
“The process could not be created because the maximum number of asynchronous responses (5) has been reached, and it will be dropped. Command executed: ………”
The following Event should be controlled:
Event Log Windows (event collected by OpsMgr2007) :
Event Log: Operations Manager
Source: Health Service Modules
This limit has been removed in OpsMgr2007 R2, but for performance reason this limit can be set also as follow.
This limit can be modified by changing the following registry key:
“HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Modules\Global\Command Executer”
· Create Keys: Global\Command Executer
· Create a DWORD value called “AsyncProcessLimit" and set it between 1 and 100.
Outside of this key, it will default back to 5
This modification can affect the RMS performance, so it’s important to not increase too much this value, and to check the performance after modifying it.
Value can be set to 20, and then EventId 21410 can be controlled to see if it’s enough, or if the value should be increased.
Much better now! ;-)
Keep up the great work.
When I ran ResetMonitorFromAllClosedAlerts.exe from RMS I got an error. We have SCOM 2007 R2 running on Win 2K8 server.
error message :
Unhandled Exception: System.security.securityexception: The source was not found, but some or all event logs could not searched. Inaccessible log: Security.
The Zone of the assembly that failed was :