Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Tweaking SCOM 2012 Management Servers for large environments

Tweaking SCOM 2012 Management Servers for large environments

  • Comments 17
  • Likes


There are many articles on tweaking certain registry settings for SCOM agents, Gateways, and Management servers, for many reasons.  Large deployments, custom 3rd party MP’s, monitoring Exchange 2010 to name a few.  Matt Goedtel has a good list on his blog:


Below – I’d like to post some settings that I change on Management Servers, when monitoring large environments.  What does “very large” mean?  Well, I’d characterize that as a management group with a significant agent count (>1000), or a very large instance space (lots of Management Packs deployed both MS and 3rd party, and custom MP’s which don’t always behave well).  Perhaps you have a very large number of groups, or groups with complex expressions.  It could be your are monitoring a large number of “agentless” items, such as Linux servers, or Network Devices, or URLs, etc.

These settings are very common, and I recommend them for all environments, with documented caveats below.


1.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        Persistence Checkpoint Depth Maximum = 104857600
SCOM 2012 default existing registry value = 20971520

All management servers, that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM)  This is an ESE DB setting which controls how often ESE writes to disk.  A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash. 

2.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        State Queue Items = 20480
SCOM 2012 default existing registry value: not present.  Value must be created.  Default code value = 10240

All management servers in a large management group:  This sets the maximum size of healthservice internal state queue.  It should be equal or larger than the number of monitor based workflows running in a healthservice.  Too small of a value, or too many workflows will cause state change loss.

3.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value: 
    PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120
SCOM 2012 existing registry value:  not present (must create PoolManager key and both values)  Default code value =  120/30 seconds

All management servers, that participate in any resource pools, that run a large number of workflows.  This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case.  The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified.  This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.

4.  Key:     HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\
REG_DWORD Decimal Value:       GroupCalcPollingIntervalMilliseconds = 900000
SCOM 2012 existing registry value:  not present (must create value).  Default code value = 30000 (30 seconds)

All management servers that participate in the All Management Servers resource pool, that have a large agent count or large number of groups:  This setting will slow down how often group calculation runs to find changes in group memberships.  Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions.  Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O.

5.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Command Timeout Seconds = 1200
SCOM 2012 existing registry value: not preset (must create "Data Warehouse" key and value)  Default in code value = 300

All management servers in a management group, this helps with dataset maintenance as the default timeout of 10 minutes is often too short.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance.  This is a very common issue.

6.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Deployment Command Timeout Seconds = 86400
SCOM 2012 existing registry value: not preset (must create "Data Warehouse" key and value)  Default in code value = 10800 seconds (3 hours)

All management servers in a management group, this helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates.  Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours.  Setting this value to allow for one full day to deploy the script resolves most customer issues.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment.  This is a very common issue in large environments are very large warehouse databases.


7.  Key:    HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
    DALInitiateClearPool = 1
    DALInitiateClearPoolSeconds = 60
SCOM 2012 existing registry value:   not present - code default - 30 seconds?

All management servers in ANY management group.  This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis.  Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage.   Per:  All management servers in a management group should get the registry change.


To summarize:

Registry Key

Reg DWORD Value Name Reg DWORD Decimal Value


Persistence Checkpoint Depth Maximum 104857600


State Queue Items 20480





PoolNetworkLatencySeconds 120

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\

GroupCalcPollingIntervalMilliseconds 900000

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Command Timeout Seconds 1200

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Deployment Command Timeout Seconds 86400

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPool 1

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPoolSeconds 60



On modifying the following:

REG_DWORD Decimal Value: 
    PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120

This should NOT be done unless you are guided to by Microsoft support, generally speaking.  If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail.  All management servers must have identical settings here.  If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop.   All the other settings in this article are generally beneficial.  This specific one for PoolManager should receive great scrutiny before changing, due to the risks.



Below are some simple reg add statement examples on how you can run to make setting these easy:

reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "State Queue Items" /t REG_DWORD /d 20480 /f
reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "Persistence Checkpoint Depth Maximum" /t REG_DWORD /d 104857600 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0" /v "GroupCalcPollingIntervalMilliseconds" /t REG_DWORD /d 900000 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Command Timeout Seconds" /t REG_DWORD /d 1200 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Deployment Command Timeout Seconds" /t REG_DWORD /d 86400 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPool" /t REG_DWORD /d 1 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPoolSeconds" /t REG_DWORD /d 60 /f

  • Hi, Kevin,
    I have seen a lot of blogs that recommend against setting the PoolManager key. They indicate that this was required when SCOM 2012 was first released but has since been fixed - implementing it now can actually degrade performance. Can you confirm this is still required for large environments?

  • Hi Diane -

    We had an issue in SCOM 2012 RC - where this was recommended as the fix.... then once RTM shipped this was no longer needed to be adjusted. Then the blogs started posting NEVER to change this and it was not recommended.... which is absolutely true for 98% of the deployments out there.

    In general, you should never change these settings unless advised to by Microsoft support, or without fully understanding the ramifications. However, this IS a valid setting, and IS recommended in VERY SPECIFIC cases where the default settings are not long enough and the result is resource pool suicide. If you aren't experiencing this problem, then in general it doesn't need to be changed.

    You can have the same conversation about the default observer, which is the database. In large environments, it is possible the default observer will be very slow due to I/O load, and there are specific scenarios where it makes sense to remove the default observer and let the management servers make the decisions for resource pool quorum. HOWEVER, it should not be removed (generally speaking) unless the customer experiences this specific issue which can only be determined via tracelogs. There are tradeoffs to making this process longer. The primary tradeoff is this increases the chances of duplication of workflows on different management servers, and potentially longer recovery times in the event of a real outage. Another tradeoff, is that ALL management servers MUST have the same settings. If any MS gets installed with the default settings, you will have constant resource pool flapping because the communication expectations are different across MS's.

    So no - I don't recommend making changes to the pool manager registry, unless you have a large environment, and you are experiencing resource pool failure far too frequently. And in those cases, we should examine the default observer behavior as well. But saying "never" change it? I disagree.

  • Hey Kevin,

    Good stuff, as always! :)

    It would be of great value if, for each of the above registry settings, associated monitoring instrumentation could be identified in order to help administrators determine whether the corresponding registry setting update should be considered for their environment.

    Examples might include:
    1) evaluating a particular PerfMon counter against a specific threshold, or
    2) the presence of specific event log entries

    One might even wonder if/why this is not already included as monitors in the OpsMgr based MPs...

  • Very helpful , good article. I wonder why Microsoft do not publish an article with those registry settings.

  • Kevin,

    Is there a difference between the HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
    "Command Timeout Seconds"
    and "Deployment Command Timeout Seconds" values? A MS Engineer during a incident advised creating the second value. Do these values conflict with each other or are they complimentary? We have the "Deployment Command Timeout Seconds" set to 86400 (1 day). At that point we were having problems upgrading from 2012 SP1 to R2. Thanks. Ted.

  • @ Ted T Hacker -

    Great question. Yes, there is. "Command Timeout Seconds" has to do with regular stored procedure calls from a SCOM workflow to the DW. Such at maintenance operations/aggregations. "Deployment Command Timeout Seconds" is different - this value has to do with scripts that are called during a major update, such as a version update, service pack, or update rollup. Changing the latter is more rare, however I have seen issues reported where these scripts got caught up blocking, and took a LONG time to complete, so rather than fail due to a timeout - we had the customer set a very long time to get them to complete. It isn't a common occurrence and generally I'd only change that one under advisement from support, like you did. All good.

  • Is the "Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange" workflow affected by the "Command Timeout Seconds" value? How can I tell if the workflow is timing out based on the registry value or is running to completion. Should the workflow always write an event (31572?) whenever it finished without issue? The monitor for this named "Data Warehouse Object Health State Data Collection Writer Periodic Data Maintenance Recovery State" has an override for the "Interval". Does changing the monitor override only affect how long the monitor waits for a 31572 before alerting?

    What is the downside to setting the "Command Timeout Seconds" to a longer time frame? I suppose at some value you just want to be alerted that it is taking a long time to process lots of data. I assume you don't want to mask the fact that there may be a monitor state change or collection rule going nuts.

  • SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  • SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  • Brett - where did you get this 500 group limitation? The product group tested up to 1000 groups when performance testing SCOM 2007. It was recommended not to go over 1000 groups simply because we didn't test beyond that. However, using groups that don't rollup health state, and using simple group memberships heavily affected this scalability concern. Now that SCOM 2012 has a distributed model for config and group population, I have not heard any limitations such as this, nor have I heard what we test up to, I'd assume likely the same 1000 groups for testing. However, I have customers beyond this and they don't have any issues with group population.

  • UR5 introduces a new registry value: Bulk Insert Command Timeout. Do you have any guidance around using this value as well?

  • @Jesse - Actually - Bulk Insert Command Timeout was a new registry control available with UR1. It wasn't added in UR5. I don't have a recommendation for adjusting this - it simply opened the capability to adjust this if needed. I only recommend changing that one if directed to by Microsoft Support to resolve a problem with bulk inserts to the warehouse, which is a rare condition. I have never worked with a customer who needed this modified from the default.

  • You mention that these changes are recommended on management servers but make no mention if they are required on a gateway server. What is Microsoft's stance on tweaking registry settings on them?

  • @JT - you are correct, I do not dictate any changes to gateways because to date, I have no experienced any changed that are proactively needed on the gateway role, regardless of management group size. They seem to handle things quite well out of the box.

  • Thanks for the quick response. I didn't think so either but I wanted to be 100% certain. Thanks again for taking the time to respond to everyone's questions and keep this site updated. It is appreciated!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs