This is more of a philosophical post I wanted to write based on some recent conversations I have had in my team when trying to educate some people on SCOM and management packs. There are different opinions on this and these are only mine. Feel free to disagree and feel free to post some comments if you do.
Within SCOM you have two overarching choices for monitoring – state based and stateless. Lets look at each and the benefits / drawbacks each one has and then I will get to the crux of things and why I think you should not try to force state based monitoring on something that you cannot really model the state.
Both of these approaches rely on some model of an application and discovery or that application and its components. I am not going to focus on this end to end except where it can impact your choice of stateless or state based monitoring. Also since notification can be used in both cases I will leave that out of the argument.
I will start with stateless. In this model you are going to be using rules to generate alerts. The model here is that a rule responds to an interesting event and raises an alert which to an operator. The rule can optionally:
Rules are simple and have been with us a long time before any type of state based monitoring was provided (and I am sure will be with us a lot longer). It is simple – I see something I care about I raise an alert.
As with all monitoring in SCOM, state based monitoring is targeted at a class. You may or may not go to a granular level of monitoring with stateless monitoring and in many ways doing deep monitoring is less interesting since you are not going to be controlling state. For example suppose I have a simple management pack and I model an application X with one or more components. My instance space may look like this.
Now if I want to target some stateless monitoring for the App X class. This is simple, I create some rules and target it at the App X class. This will ensure that it only runs where I have discovered an instance of App X.
For the App X component class I have a couple of choices with stateless monitoring. I could either model things as above and target my rules at App X component. I would then use some property of the App X component class as criteria for my monitoring so I can tell which instrumentation is for which instance e.g.
I will have at most one alert active for any instance of App X component although I could choose to have more by changing the suppression logic in my rule.
However since I am not monitoring the state of the Application X component you could ask what this deeper modeling is actually buying you other than more logic in your discovery. You could equally simplify your model to this and do away with your component class:
Now you could create the same rule as follows:
As with the previous case I will still have at most one alert active for any instance of App X component at a time because I handled this in my suppression. However, I have not modeled the App X component.
While the monitoring is effectively the same for this case there are other benefits to modeling to a deeper level. Some of these are as follows:
So you have the option of what you want to do and achieve. The point about stateless monitoring is that you are not forced to monitor to a deep level. Hold this thought as I will show you that with state based monitoring you may actually be forced to model deeper than you want.
With stateless monitoring, once an alert is generated, there is no understanding of when the problem is resolved. A user (or automated system like a connector) must close the alert. Depending on suppression settings of individual rules, new alerts may or may not get generated until this is done. You are forcing the diagnosis and resolution state of problems to the operator of the system which may indeed be valid if you application has no understanding of it’s own state.
With state based monitoring you are using monitors as your primary method. As you likely know monitors are very different to rules. Some basic characteristics of normal monitors:
When thinking about state based and stateless monitoring choices there are a few areas I want to focus on.
The idea of a monitor is that at any point of time it should know the state of a part of your model. There should be no doubt. When a monitor shows red in the console it should mean there is a problem right now. This is very different to an alert from a rule that shows at some point there was a problem. While this idea sounds great, in practice there are a number of monitor types that break this concept:
To me this breaks the whole concept of state based monitoring and using these types of monitors should have a very good reason (there are a few). You don’t understand the state of the system at any given time and instead you are making a prediction that the problem may have gone away in one case and in the other you are forcing a user to manually intervene. Logically both of these should probably have been done by a stateless rule. Monitors do have some benefits over rules though:
So when you think about monitors think about whether the benefits of the above outweigh the issue you are not accurately representing the state and you should ensure customer expectations are set on the behavior of these monitoring.
To be honest I have seen lots of examples of state based monitors that use manual / timer reset monitors that could actually be rewritten to properly determine the good state. This may involve some sort of polling of system state for the good state. While this may require the definition of a new monitor type and a bit more development time, the benefits to the customer definitely outweigh the effort it will take you to do this.
In terms of modeling, using monitors may force you to model classes deeper than you want. This may not be a bad thing but at some point you will stop. For example when monitoring a database server you may want to model down to the database level but modeling to the table or even row level with classes is not where you want to go. Using the example above, let’s consider the simple model:
Now I want to alert before on the App X component but assume I do not want to model this level in my application. If I know 101 is the good event and 102 is the bad event you might assume you can do the following and create a monitor:
Monitors cannot define alert suppression since you get a maximum of one alert form a given monitor.
This monitor will have a major problem. If you have two components (Comp1 and Comp2) consider this:
In this flow we are broken. Comp1 is still in a bad state and we are showing healthy for Application X. You may think this is obvious but I have seen this many times in MPs and this is what drove me to write this post.
If you are using a monitor you have to be able to aggregate what you are doing to the class the monitor targets. In the above example you would want to consider how you would know across all components whether App X is health or not healthy. You may not be able to do this using events. The other option is of course to model deeper:
Now the monitor would be defined as follows:
Optionally you might define a dependency monitor on the App X class to roll up health from the components. Now the monitor targeted at the component class above is just responsible for a single component and uses event parameters to filter out events about other components.
Note that monitors have another characteristic to be very aware of when you are choosing between a monitor and a rule. The alert generated from a monitor is effectively separated from the monitor state. Subsequent changes to the alert will not affect the monitor state. The major concern here is when the alert is resolved in the console. This act will not currently reset the monitor at all. If the monitor was unhealthy and the alert is resolved, the monitor will still be unhealthy until the healthy event is seen. Critically this means no future alerts will be generated from this monitor instance until the thing it is monitor is either manually reset using health explorer or the object returns to a health state itself and then goes back to a bad state. Using manual reset monitors make this worse – once a user resolves the alert it is never coming back till someone goes and resets the monitor which is a very bad experience.
So should you use state based or stateless monitoring? The answer is both. There are reasons to use both of them and the day we can get away from stateless monitoring are a long way away. As a result you are very likely to be using both in your management packs. My advice would be:
You should definitely strive for state based monitoring where possible but my key message out of all of this is not to be afraid of using rules still. Do not force state on something that is inherently stateless.
Finally, if you own the application you are building a management pack for then think about how you can improve your instrumentation to move towards being able to use monitors more by understanding and exposing application state better.