IMPORTANT: Always perform a FULL Backup of the database before doing anything to it !!!

ALSO: It is advised that you open a case at Microsoft before doing this - directly editing the database is not supported and you may find yourself in an unsupported state if anything goes wrong.

 

So I have encountered another interesting issue where we had a bad performance in SC Operations Manager 2012 (also applies to 2007 R2) because of the fact that the StateChangeEvent table was very large - the largest table actually in the database.

First of all every investigation should start with tuning the "flip-flopping" Monitors and manually grooming the "old" entries: http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx

The important thing to know here from the start is the following: the "standard" OM Grooming Job p_StateChangeEventGrooming will *only* delete the StateChangeEvent entries which are older than the configured State Grooming Threshold *AND* for which the corresponding Monitors are currently in Healthy state

The modified procedure from Kevin's Article is actually the "same" procedure p_StateChangeEventGrooming *but* it has a small, but big impact change, to delete *ALL* StateChangeEvents which are older that the configured State Grooming Threshold regardless of what state the corresponding Monitors are in.

 

Ok, should be basically pretty straight forward right? Good! But there are some situations when you tune and you tune and then all of the sudden you can observe that you have no more "noisy" Monitors and we have a "stable" normal number of StateChangeEvents per day. However, after a couple of days we can see that even with these tuning actions which we have done, we can still see that the StateChangeEvent table is again the biggest or one of the biggest top tables in the database. But why is that?!

1. Let's start our investigation by running this SQL Query on the OperationsManager database to see how many StateChangeEvent entries we have, how many should get cleaned out by the "standard" Grooming and how many will *not* be cleaned out by the "standard" Grooming even if they are older than the State Grooming Threshold:

       NOTE: Don't worry if you also get results for StateChangeEvents that should get groomed out by the standard Grooming - this means that on the next standard Grooming run, these *will* get cleaned out - you can even force this by running this Query: EXEC p_StateChangeEventGrooming.

 DECLARE
   @GroomingThresholdLocal DATETIME,
   @GroomingThresholdUTC DATETIME
SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
FROM dbo.PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
SELECT 'Should be cleaned', COUNT(*) AS 'Count'
FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
INNER JOIN dbo.State AS S WITH(NOLOCK)
   ON SCE.StateId = S.StateId
WHERE
   SCE.TimeGenerated < @GroomingThresholdUTC AND
   S.HealthState = 1
UNION
SELECT 'Should NOT be cleaned', COUNT(*) AS 'Count'
FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
INNER JOIN dbo.State AS S WITH(NOLOCK)
   ON SCE.StateId = S.StateId
WHERE
   SCE.TimeGenerated < @GroomingThresholdUTC AND
   S.HealthState != 1
UNION
SELECT 'Total StateChanges', COUNT(*) AS 'Count'
FROM StateChangeEvent

 

2. WOW! So from the above we could see that about 80% (2.255.838 to be exact) of the existing StateChangeEvent entries are *older* than the configured State Grooming Threshold but are still not cleaned-out by the standard Grooming and will not be - the only reason for this is that it seams that the corresponding Monitors are in Warning or Critical state - to get an idea of which these Monitors are, please run this Query:

 DECLARE
   @GroomingThresholdLocal DATETIME,
   @GroomingThresholdUTC DATETIME
SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
FROM dbo.PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
SELECT
   DSV.DisplayName
FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
INNER JOIN dbo.State AS S WITH(NOLOCK)
   ON SCE.StateId = S.StateId
INNER JOIN Monitor AS M
   ON S.MonitorId = M.MonitorId
INNER JOIN BaseManagedEntity AS BME
   ON S.BaseManagedEntityId = BME.BaseManagedEntityId
INNER JOIN DisplayStringView AS DSV
   ON M.MonitorId = DSV.LTStringId
WHERE
   SCE.TimeGenerated < @GroomingThresholdUTC AND
   S.HealthState != 1 AND
   DSV.LanguageCode = 'ENU' AND
   M.IsUnitMonitor = 1
GROUP BY DSV.DisplayName

 

3. Now we encounter a pretty common issue - we should check for this - usually I bet that with this very high number of "unhealthy" Monitors we should have a looooot of open Alerts which in a production environment would not look pretty good so we can bet here that the corresponding Alerts were closed manually (or automated somehow) *BUT* the corresponding Monitors were *NOT* - to check how many these are, run this Query:

 DECLARE
   @GroomingThresholdLocal DATETIME,
   @GroomingThresholdUTC DATETIME
SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
FROM dbo.PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
SELECT COUNT(*)
FROM Alert AS A
WHERE A.RuleId IN (
   SELECT
      M.MonitorId
   FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
   INNER JOIN dbo.State AS S WITH(NOLOCK)
   ON SCE.StateId = S.StateId
   INNER JOIN Monitor AS M
   ON S.MonitorId = M.MonitorId
   INNER JOIN BaseManagedEntity AS BME
      ON S.BaseManagedEntityId = BME.BaseManagedEntityId
   WHERE
      SCE.TimeGenerated < @GroomingThresholdUTC AND
      S.HealthState != 1 AND
   M.IsUnitMonitor = 1
) AND A.ResolutionState = 255

 

4. WOW! So in this case we had about 2000 results! So to solve this issue, we can execute this PowerShell Script on a Management Server under OM PowerShell:

       NOTE: Keep in mind that after running this PS Script you will have get additional new StateChangeEvent entries because a Monitor Reset is also a state change, these will be groomed out as soon as they are older than the configured State Grooming Threshold. 

 $alerts = Get-SCOMAlert | where { $_.ResolutionState -eq 255 -and $_.IsMonitorAlert -eq $true };
foreach ($alert in $alerts) {
   $mrid = $alert.MonitoringRuleId;
   $mcid = $alert.MonitoringClassId;
   $moid = $alert.MonitoringObjectId;
   $monitor = Get-SCOMMonitor | where {$_.Id -eq $mrid};
   $monitoringclass = Get-SCOMClass | where {$_.Id -eq $mcid};
   $monitoringobject = Get-SCOMMonitoringobject -class $monitoringclass | where {$_.Id -eq $moid -and $_.HealthState -ne 1};
   $monitoringobject | foreach {$_.ResetMonitoringState($monitor)};
};

 

5. After this, we should also run Kevin's modified Grooming procedure again to clean this up:

       NOTE: If you are curious about the change in the original, standard Grooming procedure, I have highlighted the change in the Query below.

 SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN
   SET NOCOUNT ON
   DECLARE
   @Err INT,
   @Ret INT,
   @DaysToKeep TINYINT,
   @GroomingThresholdLocal DATETIME,
   @GroomingThresholdUTC DATETIME,
   @TimeGroomingRan DATETIME,
   @MaxTimeGroomed DATETIME,
   @RowCount INT
SET @TimeGroomingRan = GETUTCDATE()
SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
FROM dbo.PartitionAndGroomingSettings
WHERE ObjectName = 'StateChangeEvent'
EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
SET @Err = @@ERROR
IF (@Err <> 0) BEGIN
   GOTO Error_Exit
END
SET @RowCount = 1
SELECT @MaxTimeGroomed = MAX(TimeGenerated)
FROM dbo.StateChangeEvent
WHERE TimeGenerated < @GroomingThresholdUTC
IF @MaxTimeGroomed IS NULL
   GOTO Success_Exit
DELETE MJS
FROM dbo.MonitoringJobStatus MJS
JOIN dbo.StateChangeEvent SCE
   ON SCE.StateChangeEventId = MJS.StateChangeEventId
JOIN dbo.State S WITH(NOLOCK)
   ON SCE.[StateId] = S.[StateId]
WHERE SCE.TimeGenerated < @GroomingThresholdUTC
AND S.[HealthState] in (0, 1, 2, 3)
SELECT @Err = @@ERROR
IF (@Err <> 0) BEGIN
   GOTO Error_Exit
END
WHILE (@RowCount > 0) BEGIN
   DELETE TOP (10000) SCE
   FROM dbo.StateChangeEvent SCE
   JOIN dbo.State S WITH(NOLOCK)
      ON SCE.[StateId] = S.[StateId]
   WHERE TimeGenerated < @GroomingThresholdUTC
   AND S.[HealthState] in (0,1,2,3)
   SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
   IF (@Err <> 0) BEGIN
      GOTO Error_Exit
   END
END
UPDATE dbo.PartitionAndGroomingSettings
SET GroomingRunTime = @TimeGroomingRan,
   DataGroomedMaxTime = @MaxTimeGroomed
WHERE ObjectName = 'StateChangeEvent'
SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
IF (@Err <> 0) BEGIN
   GOTO Error_Exit
END
   GOTO Success_Exit
Error_Exit:
   PRINT 'ERROR!!!'
Success_Exit:
END

 

To avoid this in the future:

  • Make sure that you always solve a problem before "closing" the issue (close Alert and Reset Monitor) because if you don't, then you will end up with another Alert and another unhealthy state of that Monitor.
  • If you don't want to solve an issue and usually just "ignore" the state of a Monitor or it's Alert, then it would be best to disable that Monitor
  • For Monitors which don't have AutoResolve option or which don't have a "Healthy State Check" always make sure that you reset the Monitor properly: http://technet.microsoft.com/en-us/library/hh212816.aspx
  • Basically make sure somehow, that you *ALWAYS* also reset a Monitor if you close it's Alert - a good idea to constantly monitor this and make sure you don't have to worry about that, is to implement some kind of automation for this, like in this great "AS IS" example using SC Orchestratorhttp://blog.scomfaq.ch/2012/05/05/reset-monitor-using-scom-2012-and-orchestrator-a-must-have-runbook/

 

Try to keep your environment healthy and always be with your eyes open for new tuning! ;)