PLEASE WATCHhttp://channel9.msdn.com/Events/MMS/2013/SD-B312

 

A lot of Service Manager environments (2010 or 2012) suffer from bad performance because of to many entity changes. Let me explain how this works:

To best explain, we will take an out of the box relationship: System.WorkItemAffectedUser

<RelationshipType id="System.WorkItemAffectedUser" accessibility="Public" abstract="false" base="System!System.Reference">
   <Source id="IsCreatedForUser" MinCardinality="0" MaxCardinality="2147483647" type="System.WorkItem" />
   <Target id="RequestedWorkItem" MinCardinality="0" MaxCardinality="1" type="System!System.User" />
</RelationshipType>

 From the above relationship definition, we can see the target type is System.User and MaxCardinality 1 (more on cardinality: http://en.wikipedia.org/wiki/Cardinality) which means here that the source System.WorkItem (for example an Incident) in this example can ONLY have ONE affected user (System.WorkItemAffectedUser relationship type).

In SM 2010, when a target object with a relationship gets updated, this is ALSO treated as an update to the source object event with MaxCardinality = 1. In other words, for our example, when the User object in this relationship is updated, it is also considered an update to the Incident object itself (to which the user is related).

This meant it would trigger workflows/notifications subscribed to updates on incidents. This is all good when you have one Incident linked to the user. !BUT - consider a scenario where there were several thousands of Incidents linked to that User and if somehow there was an update to that User, that would mean now we have to track these changes and check if there were any workflows/notifications subscribed to these updates.

So think about this - not only do we have constant updates on those tables - but think about a notification subscription which should perform some workflow on Incident update ... so if a User gets updated and all its related Incidents (even closed ones which were not groomed out yet) get updated which in turn will fire up a Workflow for each Incident update? - wait, what if we have a LOT of notifications configured for Incident update actions? - so a LOT of workflows for each Incident update which is related to the initial update of the user object ... nice huh? ... :D   

The way these updates are tracked is like this:

  • each "directly" updated object (in our example the User) will get an entry logged for each change in the EntityChangeLog (ECL) table
  • each object related to the "directly" updated object (in our example the Incidents where this User is the affected user) get an entry logged for each change in the RelatedEntityChangeLog (RECL) table

 In situations where there was a high volume of changes, there was lot of impact on the SM database (CMDB) and in turn other SM functionality and performance was impacted.

In SM 2012, this got changed so by using an exclusion mechanism, so that we do not log all the related object changes in all situations. By excluding a relationship essentially we are saying do not track the changes on the target object as the change on the source object and do not log anything to RelatedEntityChangeLog (RECL) table which in turn won't trigger any unnecessary workflows/subscriptions anymore. All the out of the box RelationshipTypes where target object MaxCardinality 1 were excluded to reduce impact by out of the box relationship types.

However this mechanism didn't guard the system from a partner/custom introduced RelatiohshipType which could result in a high volume of changes to RelatedEntityChangeLog (RECL) table and severely impact the SM environment. The only reason not to exclude the relationship would be where you want to treat the update on target object as an update on the source object and trigger any workflows that were subscribed to the source object changes.

To get the list of RelationshipTypes which are on this Exclusion-List, execute the following query on the CMDB:

 SELECT
   R.RelationshipTypeId,
   R.TargetManagedTypeId,
   R.RelationshipTypeName,
   R.TargetMaxCardinality,
   R.ManagementPackId,
   MP.MPName
FROM RelationshipType AS R WITH(NOLOCK)
INNER JOIN ManagementPack AS MP WITH(NOLOCK)
   ON R.ManagementPackId = MP.ManagementPackId
WHERE
   R.TargetMaxCardinality = 1 AND
   R.RelationshipTypeId NOT IN (
      SELECT RT.RelationshipTypeId
      FROM ExcludedRelatedEntityChangeLog AS EX WITH(NOLOCK)
      INNER JOIN RelationshipType AS RT WITH(NOLOCK)
         ON EX.RelationshipTypeId = RT.RelationshipTypeId
   )
ORDER BY MP.MPName, R.RelationshipTypeName

 

By the way because this exists – some of the closed incidents which should be groomed out are NOT groomed. This can happen because of this:

           NOTE: this will cannot happen in SM 2012 because of the exclusion-list for this RelationshipType - but with the "right" customizations ... who knows?

  1. Incidents with a Closed status and which are older than (default) 90 days (so the LastModifiedDate in the database older than 90 days) – will get deleted
  2. however let’s say that we update a User object which now is owner on many incidents including some incidents which are closed
  3. what happens then (like described in the changes explanation above) is that all those Incidents will also be updated (including the closed ones – so update on the LastModifiedDate field in the database)
  4. this will mean in turn that some closed Incidents which should be deleted because they are really old will NEVER get deleted from the database as long as the (ex.) owner User (or affected user, etc.) will constantly get updated because the Incidents will then also get updated and will never have the LastModifiedDate field older than 90 days …

 

So let us continue with an example which would help you check if you are impacted by this issue and how you could troubleshoot it.

1. to check if you are impacted by this issue during the time you notice the bad performance, run this sql query on the SQL Server with the CMDB:

 SELECT 'BME_Count', COUNT(1)
FROM BaseManagedEntity WITH(NOLOCK)
UNION SELECT 'TME_Count', COUNT(1)
FROM TypedManagedEntity WITH(NOLOCK)
UNION SELECT 'Rel_Count', COUNT(1)
FROM Relationship WITH(NOLOCK)
UNION SELECT 'ECL_Count', COUNT(1)
FROM EntityChangeLog WITH(NOLOCK)
UNION SELECT 'RECL_Count', COUNT(1)
FROM RelatedEntityChangeLog WITH(NOLOCK)
UNION SELECT 'ETL_Count', COUNT(1)
FROM EntityTransactionLog WITH(NOLOCK)

 

2. now in case that you see either of those numbers "too big" (should be related to how big your envrionemnt is, usually over 12-15 million rows should be at least investigated) - it would also be a good idea here to check if your workflows are behind (because of timeouts for example because of performance issues) or still running OK:

           NOTE: if you are using SM 2010, please delete this line from the following query:       AND W.IsPeriodicQueryEvent = 0

 SELECT 'SubETLId', MIN(W.State)
FROM CmdbInstanceSubscriptionState AS W WITH(NOLOCK)
JOIN Rules AS R
ON W.RuleId = R.RuleId
WHERE
   R.RuleEnabled <> 0
   AND W.IsPeriodicQueryEvent = 0
UNION SELECT 'ETLId', EntityTransactionLogId
FROM ImplicitUserRoleAdministratorState WITH(NOLOCK)
UNION SELECT 'MaxETLIdInECL', MAX(EntityTransactionLogId)
FROM EntityChangeLog WITH(NOLOCK)
UNION SELECT 'MaxETLIdInETL', MAX(EntityTransactionLogId)
FROM EntityTransactionLog WITH(NOLOCK)
UNION SELECT 'MaxETLIdInRECL', MAX(EntityTransactionLogId)
FROM RelatedEntityChangeLog WITH(NOLOCK)

The resulting 5 nubers should be very close to eachother. If the first number (SubETLId) is significantly smaller then the others, that means that your workflows are behind and are possibly impacted by slow perfromance. The difference between the numbers is relative and it depends on your environment - but a BIG difference clearly is an issue (maybe more than 20+ is worth investigating but this really depends a lot on the environment and instance space, number of workflows, subscriptions, etc.).

 

3. Also, you could check if there is an impact to the grooming jobs, if these run successfully or not:

 SELECT
   InternalJobHistoryId,
   Command,
   TimeStarted
FROM InternalJobHistory WITH(NOLOCK)
WHERE
   TimeFinished IS NULL AND
   StatusCode <> 1

If you get ANY results here, then that means that those grooming jobs are failing for some reason, might be a good idea to check them out, what you can do to see what is happening and why these are failing, then you could try to run these manually to see what error you get. To do that, just copy the SQL Query you see in the Command field for an entry of a failed grooming job and execute the query manually. Here is a tip with that, for most jobs, you will find that in the Command, you will see 2 commas "," one after another with 2 spaced between them - there, right before the last comma between the to commas, replace the space with this string N'' and execute the query like this.

Basically, the idea is that if you have many entries in the InternalJobHistory table which have TimeFinished = NULL *and* StatusCode = 0 *or* 2, then these jobs are failing with these situations:

  • for TimeFinished = NULL *and* StatusCode = 0 => this would mean that the Job had to process too many rows and it could not do so within the 30 minutes timeout interval
  • for TimeFinished = NULL *and* StatusCode = 2 => this would mean some other failure which would definitely be logged in the Operations Manager event log on the SM Management Server which is running Workflows - should be Event ID 10880

If there are no *actual* errors (TimeFinished = NULL *and* StatusCode = 2) and Grooming just can't keep up with the amount of data it has to delete, then you should definitely check this article out as well: http://blogs.technet.com/b/mihai/archive/2012/12/12/how-to-manually-execute-workitem-grooming-in-service-manager.aspx

 

4. Now if the ECL and/or RECL numbers are "too big" - then you should also run this query on the CMDB to check which are the Entities (objects) which contribute the most to the change log table:

 SELECT TOP 30 BME.FullName, COUNT(1)
FROM EntityChangeLog AS ECL WITH(NOLOCK)
JOIN BaseManagedEntity AS BME WITH(NOLOCK)
   ON ECL.EntityId = BME.BaseManagedEntityId
WHERE RelatedEntityId IS NULL
GROUP BY BME.FullName
ORDER BY COUNT(1) DESC

 

 5. now you should also check which Relationships contribute the most to the entries in the change log table as well:

SELECT TOP 30
   S.FullName,
   T.FullName,
   RT.RelationshipTypeName,
   COUNT(1)
FROM EntityChangeLog AS ECL WITH(NOLOCK)
JOIN BaseManagedEntity AS S WITH(NOLOCK)
   ON ECL.EntityId = S.BaseManagedEntityId
JOIN BaseManagedEntity AS T WITH(NOLOCK)
   ON ECL.RelatedEntityId = T.BaseManagedEntityId
JOIN RelationshipType AS RT WITH(NOLOCK)
   ON ECL.EntityTypeId = RT.RelationshipTypeId
WHERE RelatedEntityId IS NOT NULL
GROUP BY S.FullName, T.FullName, RT.RelationshipTypeName
ORDER BY COUNT(1) DESC

 

 6. to check which rules are causing the other subscriptions rules to lag behind, run this query:

NOTE: if you are using SM 2010, please delete these lines from the following query:       AND W.IsPeriodicQueryEvent = 0

 DECLARE
   @MaxState INT,
   @MaxStateDate DATETIME,
   @Delta INT,
   @Language NVARCHAR(3)
SET @Delta = 0
SET @Language = 'ENU'
SET @MaxState = (
   SELECT MAX(EntityTransactionLogId)
   FROM EntityChangeLog WITH(NOLOCK)
)
SET @MaxStateDate = (
   SELECT TimeAdded
   FROM EntityTransactionLog
   WHERE EntityTransactionLogId = @MaxState
)
SELECT
   R.RuleID,
   LT.LTValue AS 'Display Name',
   S.State AS 'Current Workflow Watermark',
   @MaxState AS 'Current Transaction Log Watermark',
   DATEDIFF(mi, (
      SELECT TimeAdded
      FROM EntityTransactionLog WITH(NOLOCK)
      WHERE EntityTransactionLogId = S.State
   ), @MaxStateDate) AS 'Minutes Behind',
   S.EventCount,
   S.LastNonZeroEventCount,
   R.RuleName AS 'MP Rule Name',
   MT.TypeName AS 'Source Class Name',
   S.LastModified AS 'Rule Last Modified',
   R.RuleEnabled AS 'Rule Enabled'
FROM CmdbInstanceSubscriptionState AS S WITH(NOLOCK)
LEFT OUTER JOIN Rules AS R
   ON S.RuleId = R.RuleId
LEFT OUTER JOIN ManagedType AS MT
   ON S.TypeId = MT.ManagedTypeId
LEFT OUTER JOIN LocalizedText AS LT
   ON R.RuleId = LT.MPElementId
WHERE
   S.State <= @MaxState - @Delta
   AND R.RuleEnabled <> 0
   AND LT.LTStringType = 1
   AND LT.LanguageCode = @Language
   AND S.IsPeriodicQueryEvent = 0
ORDER BY S.State ASC

 

 7. at this point you can start your tuning and investigations - if one (or 2-3) of the results there *really* stands out like for example there is a BIG diffence in the changes count compared to some other results in the lists, then you should start investigating and tuning those objects and/or relationships using guidance and (start) queries which you will find here: http://blogs.technet.com/b/servicemanager/archive/2009/12/31/service-manager-database-tour-useful-queries.aspx

And also, if you see this rule anywhere from the queries above - Incident_Desired_Configuration_Management_Custom_Rule.Update - or something related to "DCM" then also check if you are not impacted by this known issue or something similar (but different Rule maybe): http://blogs.technet.com/b/mihai/archive/2012/11/30/configuration-manager-connector-s-dcm-rule-can-cause-massive-performance-issues-in-service-manager.aspx

 

Let's take an example here, let's say that in the list of results, the top 5 objects are these:

  • System.Domain.User:TestUser3 - 1.212.540
  • Microsoft.SystemCenter.Connector:InboxConnector.ff8da8335c464c9bbad2c712aee1420b - 676.416
  • System.Domain.User:TestUser14 - 312.614
  • System.Domain.User:TestUser53 - 265.837
  • System.Domain.User:TestUser34 - 228.801

So what we can see here is that we have a connector which generates twice as much changes than the next *top* object (which is a user in this case) - but if we look at the top most object we can see it generating about 5 times the other objects - WOW ! - ok so we *may* investigate the connector (Exchange Connector in this case) - but that is not that "noisy" as the user object TestUser3 which generates 1.212.540 changes ...

So we first start by looking at out workflows that could be responsabile for updating users (most important this user - TestUser3). These can be Connectors (usually it is good/recommanded to let connectors run during the night - not duing business hours when consoles are open and used ...), Subscriptions, custom Workflows, etc. => start tuning (ex. run less often) or disabling these workflows if possible.

Next would be to check which object this user (TestUser3) is related to - either "manually" or with queries from the article above. After you have found out which these are, start checking if you can somehow get rid of this relationship (either delete the object if you do not need it or assign another user here, etc.). For example if you find out that the user is onwer of hundreds of Incidents, then delete (after closing) the unnecessary incidents and try to assign the others to another user.

 

8. after you have you have identified the Rule(s) which are lagging behind with the above queries, you might want to unblock the Watermark and make the Workflows skip the lagging Rules, by running this SQL Query (use the RuleId from the query results above for this query and replace the RULE_ID with it):

NOTE: more about this can be found here and is *strongly* recommend reading this article: http://blogs.technet.com/b/servicemanager/archive/2013/01/14/troubleshooting-workflow-performance-and-delays.aspx

 DECLARE
   @MaxTransactionId INT,
   @RuleID UNIQUEIDENTIFIER
SET @RuleID = 'RULE_ID'
SET @MaxTransactionId = (
   SELECT MAX(EntityTransactionLogId)
   FROM EntityTransactionLog
)
UPDATE CmdbInstanceSubscriptionState
SET State = @MaxTransactionId
WHERE RuleId = @RuleID

 

 

 

 

Well, this being said - have fun tuning! ;)