Here is the next entry in the posts I’m migrating over. Again this intended to be a “snapshot in time” supplement to the Implementing System Center Operations Manager 2007 at Microsoft white paper.
The Root Management Server (RMS)
So what exactly does the RMS do?
The RMS server, by definition, is the first management server installed in a management group. The RMS is differentiated from other management servers (MS) by two distinct services and a host of distinct workflows that run as a part of the health service on the RMS.
The “SDK Service” (OMSDK): When I hear the term “SDK” I typically think code libraries that I can use to write custom code against. With OpsMgr 2007 the SDK is really two things: 1) A software development kit - http://msdn2.microsoft.com/en-us/library/bb437575.aspx 2) A service running on the RMS, which is the single point of access for SDK connectivity. It is the latter part of that definition that is most relevant when thinking about deployment.
The “Config Service” (OMCFG): In MOM 2005 the centralized configuration store for the management group is the OpsDB. In turn each MOM 2005 management server was querying directly to the OpsDB to get its understanding of configuration. This was a fairly costly process that required constant resource overhead on the DB server, which was already busy enough processing operational data. The OpsDB is still the central store of configuration in OpsMgr 2007, but the RMS server has taken on the role of being the single point of access for that configuration data from the DB via the Config Service. All other systems in the management group get their configuration (directly or indirectly) from the RMS.
Workflows under the “Health Service”: A number of distinct workflows, assigned to the RMS by rules in a number of the out of the box MP’s are run exclusively by the HealthService on the RMS. Examples of these workflows include “AD assignment rules”, “Notifications”, “Health Watcher Instances” and the “OpsDB partitioning and grooming processes”. In effect many things that in MOM 2005 used to be scripts, or SQL jobs or functionality written directly into product code is now running as a rule on the RMS.
The RMS Platform
Now that you have a basic idea of the role an RMS plays in a management group let’s talk a bit about how Microsoft IT deployed this role. Given all the distinct functions the RMS serves, and the scale of IT’s management groups, they opted for the same server platform as their Operational Database (OpsDB) servers:
o Server Model: HP ProLiant DL385 G1
o Processors: 2 x dual core (4 procs in the OS’ eyes) 2.2 Ghz AMD Opteron Processors:
o RAM: 8 GB
o Drives: 2 SAN drives; one for the cluster quorum and the other for storing the various OpsMgr 2007 service state directories that are shared between nodes of the RMS.
o Quorum drive: 2GB RAID 5 – nothing fancy here; less than 20mb is actually in use.
o State drive: 10GB RAID 0+1: Typically less than 3GB of actually data on this drive but I/O is high at scale.
o OS: Windows Server 2003 Enterprise x64 Edition with SP1
Using that platform Microsoft IT has seen the average RMS at 38.6% CPU utilization and memory paging of ~86 pages/sec. The state drive is quite busy sustaining an average of ~1200 transfers/sec and an average data rate of 14.89MB per second. In both cases ~95% of the drive activity is writes.
If you take resource utilization down to the level of the OpsMgr 2007 specific process that are running on the RMS the top consumer in IT’s deployments is the config service (Microsoft.MOM.ConfigServiceHost.exe), followed by the Health Service (HealthService.exe) and then the SDK service (Microsoft.MOM.Sdk.ServiceHost.exe) and Monitoring Host (MoniotringHost.exe) processes. The following table shows the average “% Processor Time” and “Private Bytes” for the relevant processes on the RMS:
% Processor Time
Given that the RMS is so vital to the functionality of a management group, IT planned from the earliest design phases to make the investment to ensure high availability (HA) of this role. With that in mind IT worked with the OpsMgr product group early on to test the setup and use of clustered RMS’. A clustered RMS is comprised of a resource group containing a network name, a dedicated IP, 3 shared services (HealthService; Config Service; SDK Service) and a shared drive for holding the central state files used by the shared services (referred to above as the state drive). With clustering of the RMS configured, automated failover can occur during and un-expected outage, as well as planned failovers during system upgrades or maintenance work. Microsoft IT’s experiences to date with RMS clustering have been very positive, but the key take away from both the deployment of the RMS and the OpsDB is that the monitoring team has built up its knowledge around configuring and working with clusters and clustered resources. The setup process is well documented in the OpsMgr 2007 deployment guide
Similarly to how IT deployed the OpsDB, they rely on a different approach for business continuance/disaster recovery (BC/DR) than they do for HA. In order to achieve this Microsoft IT deployed an additional management server in each management group, whose sole purpose is to await the day that a disaster occurs. Here are the general steps that were taken to setup this “RMS standby”:
o Build out the standard management group.
o Install the final management server on the geographically remote server.
o Backup the clustered RMS’ encryption keys with the “SecureStorageBackup.exe” tool from the \SupportTools directory (a one-time deal).
In the future event of a disaster the encryption keys can be restored to the remote management server, the OpsDB failed over, and then the remote management server can be promoted to be the RMS.
The RMS golden rule: Location, location, location!
Choosing the right location (domain and network) for a root management server is very important when designing a deployment. Whether intentionally and proactively, or inadvertently and reactively, the IT monitoring team found themselves thinking a lot about the following features when choosing where they were going to setup their root management servers:
o Mutual authentication: In MOM 2005 it was optional, but in OpsMgr 2007 mutual authentication is required. This means that every communication channel that exists, both ends need to be able to confirm the identity of the other end-point. This can be done via Kerberos (domain/forest trusts) or via certificate based authentication. As such IT intentionally joins every RMS to an active directory domain that has the greatest number of two-way trusts with the various domains where agents reside.
o AD assignment rules: One of the top 3 new features of OpsMgr 2007 (in my unofficial opinion) is the fact that agents can now be installed with zero-configuration, and when they start up they can query AD for what management groups, and servers they should be talking to. That configuration information that exists in AD is maintained by rules running on the root management server. If AD assignment rules are to be used then the RMS must be able to communicate with domains that rules will be run against.
o Operations Console/SDK Access: The SDK service running on the RMS is the only point to which users’ consoles can connect. If the SDK service is inaccessible then visibility to the monitoring data being collected by that management group is inaccessible, and in some senses the management group can be considered logically unavailable. As such IT was deliberate to locate their RMS’ in locations that are widely accessible.
o User Roles across tiers: Microsoft IT is deployed in a tiered fashion with three management groups as of the time of this post. Their users access a single management group, and from there they use the “Show Connected Alerts” feature to view alerts from all management groups within a single view. When that user turns on “Show Connected Alerts” their credentials are passed to the various mid-tier management groups to get the alerts from each. During that authentication process their user role’s scope is applied and the resulting alert set is filtered appropriately. Ensuring that all RMS servers can communicate and authenticate with the same domains allows users roles to be defined once in a central MG and then replicated to the other MG’s. This is by no means a requirement, but it can simply administration.
o The Data Warehouse: While most write activity occurs between the management servers and the data warehouse DB server, the RMS does write some data to the DWH DB as well. This communication needs to be accounted for when designing a deployment. Further details on the ports and protocols required can be found in the Operations Manager 2007 Security Guide.
Microsoft could do better in the business continuance/disaster recovery arena by providing a simple wizard to automate the promotion/demotion of the RMS.
Most DR scenarios usually involve a site failure (power or network) that simple clustering won't resolve. The steps required to failover to a remote site (importing the RMS keys and updating the agents) currently require someone with sufficient rights to follow a seperate DR procedure document. It would be nice if this could be done from the GUI (where the admins live).