One of the longest standing asks on the Service Manager Team has been to improve the performance of the Active Directory (AD) Connector. The AD connector is the primary source for bringing in data about users, groups, printers, and computers as configuration items (CIs) from Active Directory Domain Services (AD DS) into the Service Manager database (CMDB). As you can imagine proper functioning of the AD Connector is critical to realize many of the scenarios in Service Manager. It is also one of the most stressed components in Service Manager due to the volume of data synchronized, especially in large environments with 50,000+ users and/or computers.
This post will discuss a workaround which can help the AD connector scale and perform better, albeit with some tradeoffs which are discussed in detail below. Read on if you’re still interested :)
At a very high level the AD Connector works by doing a full sync of data from the domain or OU chosen by the administrator on its first run. Subsequent runs typically sync only the new information or changes made to the domain or OU. The connector first connects to AD DS using the LDAP protocol and then pulls in objects to the staging tables in the Service Manager database. Once the objects are staged, the connector then processes and transforms these objects and inserts them into the regular Service Manager tables. This information is then consumed by the Console, Workflows and various other solutions and sub-systems in SM. For more information on the AD connector click here.
The design described above works well when operating within the supported limits on users and computer objects as published here. In truth, however, many Service Manager deployments operate far beyond the supported configuration, with a few at 5-7x the recommended limits. The large volume of data in such environments has a significant performance impact on the implied permissions security model in SM – a component which is called repeatedly during any AD connector sync. In our investigations we saw that the time spent in calculating these permissions accounts for a large percentage of the total sync time – so naturally we decided to take a deeper look.
The implied permissions security model in SM is responsible for handling the implicit mapping of user objects in the CMDB to the permissions which they may have on other objects in the system. For example, Service Manager implicitly allows all users to read and edit Incidents where they are the Affected User, or read and edit the review activities where they are the Assigned Reviewer. Similarly, all users are also implicitly granted the necessary permissions to edit the email address, time-zone, and locale information associated with themselves in the CMDB. To achieve this implicit mapping of permissions, Service Manager depends on Windows Authorization Manager (AzMan) to provide a framework for Role Based Access Control (RBAC).
So, during an AD Connector sync, for each user object brought in from AD, Service Manager makes multiple calls to AzMan to build the necessary ImpliedUserPreference (see the table 1 below) permissions. These calls however can be fairly expensive because of scaling limitations inherent in AzMan which, per our investigation, frequently queues up calls for extended amounts of time. Collectively these round trips to AzMan are the primary reason for the sluggish performance of the AD connector.
So we had two options, one was to remove our dependency on AzMan entirely and rewire our security model, and the other was to analyze AzMan interactions in the product code, and opportunistically look for ways to improve performance. Rewriting the security model being a much larger and far more invasive approach, we decided to explore the latter first. The remainder of this post will discuss one workaround which has shown promising results both internally and with a few large-scale customer deployments.
As discussed above, Service Manager provides an out of the box “implied permissions” security model. This system implicitly allows users that are related to objects by certain relationships to have the appropriate permissions on those objects. Let’s look at a few detailed examples of such permissions:
• A user related to a configuration item by the System.ConfigItemOwnedByUser (custodian) relationship is automatically granted read permission on that configuration item object and everything it contains so long as they are related to that configuration item
• A user related to an incident by the System.WorkItemAffectedUser relationship is automatically granted view and edit permissions on that work item and everything it contains
• A reviewer related to a review activity by the ImpliedReviewer relationship is automatically granted permission to vote on a review activity assigned to them
In all, there are 6 such implied permission relationship types. Each of these implied permissions are implemented using a workflow. The table below provides a brief description of each.
Implied Permission Relationship Type
Grants users permission to view/edit those incidents for which they are the affected user
Grant users the ability to view any CI for which they are the custodian
Grant users the ability to view any computer for which they are the primary computer user
Grant users the ability to vote on review activities which they are a reviewer on. This applies only to the person himself and not to reviewer groups which the user is member of
Grant users the permission to view/edit the notification address and locale information associated with themselves
Grant users the permission to view/edit any manual activity for which they are the assigned user
To read more on implied permissions and user roles click here.
In the case of AD Connector the ImpliedUserPreference permission typically requires multiple round-trips to AzMan for each imported user. At scale this process takes a huge toll on performance of the system as a whole. So one clear way to improve performance would then be to just disable this permission. However, there is no easy way to disable just one permission, while keeping the others active.
So the other option is to disable implied permissions entirely and instead explicitly grant users access to objects they need via the use of user roles. Let’s look at this option and the alternate permissions model in more detail.
What do you do to compensate?
Put NT Authority\Authenticated Users in the Incident Resolver* user role.
Create a group in AD of people that are responsible for approving review activities. Put that AD user group in the Advanced Operator* user role.
This permission is not necessary unless end users are somehow given a user experience for changing these settings. Service Manager does not provide a user experience for this out of the box.
Create a group in AD of people that are responsible for implementing manual activities. Put that AD user group in the Activity Implementers* user role.
Table 2 *For a detailed description of the users roles see Table 3 below
Table 3 below lists the access privileges provided out of the box with the user roles discussed in table 2 above
User Role Profile Name
User Role Profile Description
Incident Resolvers can edit and create incidents, problems, and manual activities that are in their queue scope. Incident Resolvers also have read-only access to other work items such as change requests that are in their queue scope and to configuration items that are in their group scope.
Advanced Operators can create or edit any work items that are in their queue scope and any configuration items that are in their group scope. They can also create, edit, and delete the announcements that are displayed on the Service Manager Self-Service portal.
Activity Implementers can edit only manual activities that are in their queue scope. They have read-only access to other work items that are in their queue scope and to configuration items that are in their group scope.
Note 1:Disabling implied permissions considerably loosens the security model, please carefully study the tables above to decide if this workaround works for you. It also mandates fine-grained control over the memberships of a couple of user groups.
Note 2:This workaround is designed to provide relief to customers operating at volumes significantly larger than the supported configurations. Please evaluate the security policies, processes, and best practices in place at your organization before applying this workaround.
After implementing the workaround you will see the following behavior.
It is highly recommended that you first implement this workaround in a lab environment, and not directly in production. Once implemented validate that common scenarios work as expected. Additionally, perform tests to ensure that the granting users permissions via user roles works for you and is a workable option.
Once satisfied with the results in your test/lab environment, take a snapshot/backup of the production environment before proceeding further. This is especially important on mission-critical setups of the system. Also plan for some downtime (roughly 1-2 hours). Ensure that you read and understand the steps to perform a disaster recovery here.
Import the attached DisableImpliedPermissionsRuleMP.xml MP to override the implied permissions workflow. By importing this MP, you will have disabled the Implied Permissions workflow immediately. As always, once you import the MP please restart the SM services. Next follow verification steps in order to make sure that the MP got imported successfully and that the workflow is disabled.
First, ensure that the MP was imported properly and is listed as an installed management in the console.
Next, to validate that the workflow has been disabled perform the following steps:
1. Ensure that you have at least one user in the “End Users” user group
2. Log into SM Portal as an end user – you can use the script below to start internet explorer in the end user context: runas.exe /u:<DOMAIN\USERNAME> "C:\Program Files\Internet Explorer\iexplore.exe"
3. Create a generic incident from the portal and submit the ticket - by default the end user should be added as the “Affected User” on the ticket – you can confirm that by opening the ticket in the SM Console and checking the “Affected User” field.
4. Next, navigate to the “My Requests” view in portal, you should not be able to see the ticket you just created because the workflow which calculates the “ImpliedIncidentAffectedUser” permission is now disabled
5. Switch back to the SM console and add the end user from step 2 to the “Incident Resolver” user role
6. Refresh the “My Requests” view in portal and the ticket you created as the end user should now be visible again.
7. Ensure that you are only seeing the tickets which were created by you and not all tickets. If the ticket is not immediately visible, wait for a few minutes and refresh the portal.
Product upgrade is not impacted by the workaround. We validated this by upgrading our 2012 SP1 lab to 2012 R2 with the workaround installed. There were no upgrade problems. Additionally, the MP with the override was still in place and the implied permissions workflow was still disabled.
If for any reason you want to roll back this workaround you will have to delete the imported MP and also roll back any explicit permissions provided to your users. Soon after the MP is deleted the workflow will kick in again and will start processing from the last watermark. Depending on how old the watermark is, and how much work the workflow has to “catch-up” to, the system might seem unresponsive and sluggish for an extended period of time. You can use the following queries to find out how many changes the workflow will have to catch up to when enabled:
select EntityTransactionLogID as Watermark from [dbo].[ImplicitUserRoleAdministratorState]
select max(entitytransactionlogID) as LatestChange from EntityChangeLog
The first query gives you the last change which was processed by the implied permissions workflow, and the second one gives you the ID of the last change in the EntityChangeLog table. The difference between these two values is a good indicator of the volume of changes the workflow might have to churn through when re-enabled.
Note: In case a considerable amount of time has passed since you implemented the workaround, or you don’t want the performance of you system to be impacted, our recommendation is that you delete the watermark form SQL by running the following command before deleting the MP.
delete from [dbo].[ImplicitUserRoleAdministratorState]
This will reset the watermark and once the workflow is started back again, it will use the most recent ID as the new watermark.
Note: If you reset the watermark and roll back the MP, all the entries which were made since the workaround was implemented will not be processed and thus some users might not have access to do the related actions. To resolve this you will have to delete and re-import the effected users from AD.
Based on our testing this workaround goes a long way in alleviating AD connector performance concerns bypassing AzMan. However, as mentioned earlier, this workaround has a significant impact on how the system works. It requires a fair bit of planning and analysis to understand how the business processes in place might be impacted by disabling of implied permissions calculation. It also requires a good understanding of various implicit permission types and the alternate ways to explicitly grant the required permissions. We recommend that you do test this thoroughly across your scenarios. If you have questions, reach us directly on this blog post. That’s the most efficient way to get our inputs on this topic.
This blog post would not have been made possible if not for the contributions and many hours of investigation by Manoj Parvathaneni, Jay Pathak, and Mihai Sarbulescu. Special thanks to a couple of our customers who were patient enough to work with us on this investigation.
Please share your thoughts, views, and questions on this post in the comments section below.
Beautiful! Thank you so much guys :-)
Nice post guys
Thanks for the post. Just to be clear, do you mean the built-in User Roles already there, like the Incident Resolver. Or is it enough to make a new User Role based on the Incident Resolver. Otherwise it would be hard to make any kind of fine-grained control with the authenticated users in the out-of-the-box role
If I create a new user role for end users based off of the Advanced Operator Role, this should negate the need to add them to the Incident Resolver, and Activity Implementer roles correct?
Wouldn't changing the AD connector synch schedule to off peak hours like the weekend also address this issue?