Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

The 31552 event, or “why is my data warehouse server consuming so much CPU?”

The 31552 event, or “why is my data warehouse server consuming so much CPU?”

  • Comments 19
  • Likes

A very common customer scenario – is where all of a sudden you start getting these 31552 events on the RMS, every 10 minutes.  This drives a monitor state and generates an alert when the monitor goes red.

image

 

However – most of the time my experience is that this alert gets “missed” in all the other alerts that OpsMgr raises throughout the day.  Eventually, customers will notice the state of the RMS is critical, or their availability reports take forever or start timing out, or they notice that CPU on the data warehouse server is pegged or very high.  It maybe be several days before they are even aware of the condition.

 

image

image

 

 

The 31552 event is similar to below:

Date and Time: 8/26/2010 11:10:10 AM
Log Name: Operations Manager
Source: Health Service Modules
Event Number: 31552
Level: 1
Logging Computer: OMRMS.opsmgr.net
User: N/A
Description:
Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance Instance name: State data set Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36} Management group: PROD1
 

The alert is:

Data Warehouse object health state data dedicated maintenance process failed to perform maintenance operation

Data Warehouse object health state data dedicated maintenance process failed to perform maintenance operation. Failed to store data in the Data Warehouse.
Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

One or more workflows were affected by this.

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36}
Management group: PROD1

 

 

Now – there can be MANY causes of getting this 31552 event and monitor state.  There is NO SINGLE diagnosis or solution.  Generally – we recommend you call into MS support when impacted by this so your specific issue can be evaluated. 

 

The most common issues causing the 31552 events seen are:

  • A sudden flood (or excessive sustained amounts) of data to the warehouse that is causing aggregations to fail moving forward.
  • The Exchange 2010 MP is imported into an environment with lots of statechanges happening.
  • Excessively large ManagedEntityProperty tables causing maintenance to fail because it cannot be parsed quickly enough in the time allotted.
  • Too many tables joined in a view or query (>256 tables) when using SQL 2005 as the DB Engine
  • Poor SQL performance issues (typically disk I/O latency)
  • Using SQL standard edition, you might see these randomly at night, during maintenance as online indexing is not supported using SQL standard edition.
  • Messed up SQL permissions
  • Too much data in the warehouse staging tables which was not processed due to an issue and is now too much to be processed at one time.
  • Random 31552’s caused my DBA maintenance, backup operations, etc..

If you think you are impacted with this, and have an excessively large ManagedEntityProperty table – the best bet is to open a support case.  This requires careful diagnosis and involves manually deleting data from the database which is only supported when directed by a Microsoft Support Professional.

 

The “too many tables” is EASY to diagnose – because the text of the 31552 event will state exactly that.  That is easily fixed by reducing data warehouse retention of the affected dataset type.

 

 

Now – the MOST common scenario I seem to run into – actually just happened to me in my lab environment, which prompted this article.  I this this happen in customer environments all too often.

I had a monitor which was based on Windows Events.  There was a “bad” event and a “good” event.  However – something broke in the application – and cause BOTH events to be entered in the application log multiple times a second.  We could argue this is a bad monitor, or a defective logging module for the application…. but regardless, the condition is a monitor of ANY type starts flapping, changing from good to bad to good WAY too many times. 

What resulted – was 21,000 state changes for my monitor, within a 15 MINUTE period.

image

 

At the same time, all the aggregate rollup, and dependency monitors, were also having to process these statechanges…. which are also recorded as a statechange event in the database.  So you can see – a SINGLE bad monitor can wreak havoc on the entire system… affecting many more monitors in the health state rollup.

 

While the Operations Database handles these inserts quite well, while the DataWarehouse does not.  Each statechangeevent is written to both databases.  The standard dataset maintenance job is kicked off every 60 seconds on the warehouse.  This is called by a rule (Standard Data Warehouse Data Set maintenance rule) which targets the “Standard Data Set” class, which executes a specialized write action to start maintenance on the warehouse.

What is failing here – is that the maintenance operation (which also handles hourly and daily dataset aggregations for reports) is failing to complete in the default time allotted.  Essentially – there are SO many statechanges in a given hour – that the maintenance operation cannot complete and times out, rolling back the transaction.  This is a never-ending loop, which is why it never seems to “catch up”… because a single large transaction that cannot complete blocks this being committed to the database.  Under normal circumstances – 10 minutes is plenty of time to complete these aggregations, but under a flood condition, there are too many statechanges to calculate the time in state for each monitor and instance, to complete.

So – the solution here is fairly simple: 

  • First – solve the initial problem that caused the flood.  Ensure you don’t have too many statechanges constantly coming in that are contributing to this.  I discuss how to detect this condition and rectify it HERE.
  • Second – we need to disabled to standard built in maintenance that is failing, and run it manually, so it can complete with success. 

For the second step above – here is the process:

1.  Using the instance name section in the 31552 event, find the dataset that is causing the timeout (See the highlighted section in the event below)

Date and Time: 8/26/2010 11:10:10 AM
Log Name: Operations Manager
Source: Health Service Modules
Event Number: 31552
Level: 1
Logging Computer: OMRMS.opsmgr.net
User: N/A
   Description:
Failed to store data in the Data Warehouse. Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. One or more workflows were affected by this.

Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {50F43FBB-3F59-10DA-AD1F-77E61C831E36}
Management group: PROD1

2.  Create an override to disable the maintenance procedure for this data set:

  • In the OpsMgr console go to Authoring-> Rules-> Change Scope to “Standard Data Set”
  • Right click the rule “Standard Data Warehouse Data Set maintenance rule” > Overrides > Override the rule > For a specific object of class: Standard Data Set
  • Select the data set that you found from the event in step 1.
  • Check the box next to Enabled and change the override value to “False”, and then apply the changes.
  • This will disable dataset maintenance from running automatically for the given dataset type.

3.  Restart the “System Center Management” service on the RMS.  This is done to kill any maintenance already running, and ensure the override is applied immediately.

4.  Wait 10 minutes and then connect to the SQL server that hosts the OperationsManagerDW database and open SQL Management Studio.

5. Run the query below replacing the highlighted portion with the name of the dataset from step 1.

**Note: This query could several hours to complete.  This is dependent on how much data has been flooded to the warehouse, and how behind it is in processing.  Do not stop the query
prior to completion.

USE [OperationsManagerDW]
DECLARE @DataSet uniqueidentifier
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State')
EXEC StandardDatasetMaintenance @DataSet

6. Once the query finishes, delete the override configured in step 2.

7. Monitor the event log for any further timeout events.

**Note:  One of the biggest challenges is often getting maintenance to stop running on its own, so you can run it manually.  The reason this is important, is that one of the first things that StandardDatasetMaintenance does, is to check to see if it is already running.  If it is, we cannot run it manually and trying to run it manually will simply cause it to start and complete in about a second.  If you are having trouble getting it to run manually, make sure you have created the override, and waited at least 10 minutes for the previous job to time out.  Some customers elect to restart the healthservice on all management servers to kill the previous job, or go as far as to restart the SQL service.  Another tactic is to disable maintenance from ALL objects of class “Dataset”, then wait 10 minutes.

 

In my case – my maintenance task ran for 25 minutes then completed.  In most customer environments – this can take several hours to complete, depending on how powerful their SQL servers are and how big the backlog is.  If the maintenance task returns immediately and does not appear to run, ensure your override is set correctly, and try again after 10 minutes.  Maintenance will not run if the warehouse thinks it is already running.

***Note:  Now – this seemed to clear up my issue, as immediately the 31552’s were gone.  However – at 2am, they came back, every 10 minutes again and my warehouse CPU was spiked again.  My assumption here – is that it got through the hourly aggregations flood, and now it was trying to get through the daily aggregations work and had the same issue.  So – when I discovered this was sick again – I used the same procedure above, and this time the job took the same 25 minutes.  I have seen this same behavior with a customer �� where it took several days to “plow through” the flood of data to finally get to a state where the maintenance would always complete in the 10 minute time period.

 

Cory Delamarter has written a good query, to figure out how far behind you are:  http://blogs.technet.com/b/corydelamarter/archive/2011/02/17/my-opsmgr-data-warehouse-is-running-behind.aspx   I will include it below:

USE OperationsManagerDW; WITH AggregationInfo AS ( SELECT AggregationType = CASE WHEN AggregationTypeId = 0 THEN 'Raw' WHEN AggregationTypeId = 20 THEN 'Hourly' WHEN AggregationTypeId = 30 THEN 'Daily' ELSE NULL END ,AggregationTypeId ,MIN(AggregationDateTime) as 'TimeUTC_NextToAggregate' ,COUNT(AggregationDateTime) as 'Count_OutstandingAggregations' ,DatasetId FROM StandardDatasetAggregationHistory WHERE LastAggregationDurationSeconds IS NULL GROUP BY DatasetId, AggregationTypeId ) SELECT SDS.SchemaName ,AI.AggregationType ,AI.TimeUTC_NextToAggregate ,Count_OutstandingAggregations ,SDA.MaxDataAgeDays ,SDA.LastGroomingDateTime ,SDS.DebugLevel ,AI.DataSetId FROM StandardDataSet AS SDS WITH(NOLOCK) JOIN AggregationInfo AS AI WITH(NOLOCK) ON SDS.DatasetId = AI.DatasetId JOIN dbo.StandardDatasetAggregation AS SDA WITH(NOLOCK) ON SDA.DatasetId = SDS.DatasetId AND SDA.AggregationTypeID = AI.AggregationTypeID ORDER BY SchemaName DESC

This will output how many aggregations you are behind on, and of which type.  Very handy to help calculate how long it will take to get this cleaned up!

Additionally – Cory has written a loop command to keep running maintenance.  This is important as each run of maintenance will only process a single hourly or daily aggregation.  Once you have figured out how long each run takes, you can get an idea of how long it will take to catch up – and kick this looping script off:

USE OperationsManagerDW DECLARE @DataSetName varchar(50) -- Set this string to the "SchemaName" value that is behind SET @DataSetName = 'State' WHILE ( (SELECT COUNT(AggregationDateTime) FROM StandardDatasetAggregationHistory AS ah WITH(NOLOCK) INNER JOIN StandardDataSet AS ds WITH(NOLOCK) ON ah.DatasetId = ds.DatasetId WHERE ds.SchemaName = @DataSetName AND LastAggregationDurationSeconds IS NULL) > 1 ) BEGIN BEGIN TRANSACTION; USE [OperationsManagerDW] DECLARE @DataSet uniqueidentifier SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = @DataSetName) EXEC StandardDatasetMaintenance @DataSet COMMIT TRANSACTION; END

Another step in the process, is to increase the time allowed for aggregations to complete.  If you had a terrible state storm, it might take HOURS for your aggregations to process.  However, if you are just reaching a level where your state changes are just too heavy, or your SQL IO is not high enough, you might consider increasing the timeout for aggregations to run.

The dataset maintenance runs every 60 seconds.  In the State Dataset, we delay starting maintenance for 5 minutes, and then we have a 5 minute default timeout.  This is why you might see a 31552 event every 10 minutes like clockwork. 

What we can do, is to create a new registry key and value on EVERY management server, with a new timeout value. 

Open the registry, and locate HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0    Create a new Key under “3.0” named “Data Warehouse”.  Then create a new DWORD value named “Command Timeout Seconds” with a value of 900 seconds.  This will take the default 5 minute timeout to 15 minutes:

image

If you cannot complete your aggregations any faster than 15 minutes under normal conditions, something is very wrong and you need to resolve the root cause.

 

 

This is a good, simple process to try to resolve the issue yourself, without having to log a call with Microsoft first.  There is no risk in attempting this process yourself – to see if it can resolve your issue.

If you are still seeing timeout events, there are other issues involved.  I’d recommend opening a call up with Microsoft that that point.

Again – this is just ONE TYPE of (very common) 31552 issue.  There are many others, and careful diagnosis is needed.  Never assume someone else's fix will resolve your specific problem, and NEVER edit an OpsMgr database directly unless under the direct support of a Microsoft support engineer.

 

 

(***Special thanks to Chris Wallen, a Sr. Support Escalation Engineer in Microsoft Support for assisting with the data for this article)

Comments
  • kevin,

    You mentioned under the "The most common issues causing the 31552 events seen are:" section that

    "Using SQL standard edition, you might see these randomly at night, during maintenance as online indexing is not supported using SQL standard edition."

    can cause this "31552" error message so are you saying that people shouldn't run SCOM 2007 R2 on Standard SQL version? If, yes how come Microsoft SCOM supported configuration document dosn't say that? If Standard SQL is supported and we know that SCOM relays on online indexing then how can I remove this dependancy to eliminate these errors?

    Thanks

    Murad

  • Yes - using SQL standard edition can cause an occasional 31552 during large maintenance operations.

    I did NOT say "people shouldn't run SQL standard edition".  It is supported.... I am just saying that while index operations occur, changes to the database can be blocked, which can cause this error to randomly occur.  It isnt an issue.  It is an explanation.

    You have to ignore these errors if they are random, and dont repeat.  It is only a cause of concern if they do not recover from the situation.  If it recovers, it will change state back to healthy and close the alert.

  • Hi Kevin

    I really needs your help – Right now we are facing the problem described in this blog (we have had the problem before and have used your blog with success) but this times there seems to be too much data to process, because when I’m running the “StandardDatasetMaintenance” job it runs for around 3 hours and then the SQL comes with the following error:

    Msg 777971002, Level 16, State 1, Procedure StandardDatasetMaintenance, Line 142

    Sql execution failed. Error 777971002, Level 16, State 1, Procedure StandardDatasetAggregate, Line 424, Message: Sql execution failed. Error 777971002, Level 16, State 1, Procedure StateAggregate, Line 341, Message: Sql execution failed. Error 666, Level 16, State 2, Procedure StateAggregate, Line 196, Message: The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 422213163220992. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.

    Have you seen this error before?  or do you have any ideas on what it could be?

    Do you know if there is a way to run the “StandardDatasetMaintenance” so it only process smaller chunks of data?

    Right now I’m running a case with Microsoft on this issue, but it could be nice if you have some input for me  :o)

  • @HC -

    I have not seen that condition, and I would recommend opening a case with Microsoft for that.  Please keep us posted on what the issues was.

  • Hi Kevin

    I already have a Microsoft case running – I will let you know as soon we solve the problem.

  • Hi Kevin

    Now we have solved the problem – I find this blog that solved it for me. nocentdocent.wordpress.com/.../troubleshooting-statechangeevent-grow

    The problem:

    On the 4th of December we have had a problem with an ACS collector – that problem have generated ~160.000 state changes between 03:00 AM and 04:00 AM.

    At that point we were unable to process all these state changes even when we run the “StandardDatasetMaintenance” described in your blog here. It where running for around 3 hours (the TempDB grow to ~150 GB while it’s running) and then the SQL came with the error I posted earlier.

    How we solved it:

    By reading the blog I posted in the beginning of this comment, I was able to find the ~160.000 rows that where generated on the 4th between 03:00 AM and 04:00 AM, and after that I was able to delete them manually (Totally unsupported – but it is working).

    After I have deleted all the rows we were able to run the “StandardDatasetMaintenance” manually a few times and then enable the rule again – Now everything is running smooth again.

    Thanks to Vikram Sahay from Microsoft support – he really help on this problem

  • We have too much state data (852 days) and am concerned if we just manually run the StandardDatasetMaintenance our trans log or temp db will fill up.  Is there anyway to chunk up the grooming on our data warehouse?  I tried to change the grooing period in the OpsMgr UI to 750 days so that when I run the StandardDatasetMaintenance it will groom out all data older than 750 days and then work backwards.  Unfortunatly the UI doesn't allow me to change the grooming period longer than 60 days.

  • Hey Keith - shoot me an email contact via the blog with your email, and lets have a chat.  :-)

  • Tank you! You saved my day!!!

    Running SCOM 2007 R2/ SQL 2008 R2 Enterprice /w CLU5

  • I had the same issue due to the ACS collector.  I used Daniele's fix and it worked, but now the manual SQL run of the Standard dataset maintenance has been running for over 4 hrs!!  I am worried!

  • @Gene -

    I have seen the manual runs taken over 24 hours.  Just let it go.  It all depends on how much we have to process, and how far behind we are.  You just keep on running it, over and over - until it is caught up.  It is painful - but handled once it is done.  

  • Thx Kevin..wll do!  its at 5 hrs now and still going.  Will post again tomorrow morning and hope its done.

  • Dear Kevin,

    the query has been running for over 24 hours now and it is filling the TempDB Log which has grown beyond 100 GB now and we are running out of space.. Can we stop the query will that have any affect .

    We want to do so the clear the Tempdb space and then start the query again , but i am afraid  what query daid for past 24 hours ..

  • Dear Kevin,

    The Query has been running for 40GB, can we stop and restart the query , also would you consider 5.5GB data in ManagedEntityProperty Table as Excessive.

  • Dear Kevin,

    My Data Warehouse database is pretty new i.e. 60 days old which is partly in production and I'm facing similar issue

    When i run this query it is more than 24hrs and still haven't completed....

    Database is hosted on 3 node cluster which is Hex Core - 2 Processor with 48GB RAM

    I'm not sure if its fine or something else is wrong

    Regards

    Mahmood

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs