I will be progressively moving these posts over from my previous blog from when I worked for Microsoft IT. The information is somewhat dated, but I didn't want to see the content disappear completely :). You can think of this as a “snapshot in time” supplement to the Implementing System Center Operations Manager 2007 at Microsoft white paper.
If you’re reading this series of posts to help make procurement decisions and want even more detail I suggest checking out the following resources:
Following is the average scale of one of Microsoft IT's OpsMgr 2007 management groups, and should be used as a point of reference for any performance figures provided in this series of posts:
· Agents: 3500
· Alerts per day: 3500
· Events per day: 453,000
· Perf Samples per day: 16.3 million
Before Microsoft IT purchased any hardware for OpsMgr 2007 they took a week long performance baseline of their MOM 2005 deployments. Based on numbers from MOM 2005 SP1 database systems they found that CPU utilization was fairly low (19% on average) as was memory utilization (average of 65 pages per second). The area where resource utilization was relatively high was at the disk level. Following were the average data transfer rates:
· Drive holding DB data file: 0.8 MB/sec with an average of 30 transfers/sec
· Drive holding DB log file: 0.6 MB/sec with an average of 162 transfers/sec
· Drive holding TempDB files: 0.9 MB/sec with an average of 20 transfers/sec
So when it came time to make a purchasing decision for OpsMgr 2007 hardware MSIT stuck fairly close to same platform configuration for their MOM 2005 OpsDB systems, working under the assumption that the newer hardware in the currents SKUs of the day would proportionately handle whatever increased load OpsMgr 2007 would bring. Following is what they ultimately used and where changes were made they are noted:
· Server Model: HP ProLiant DL385 G1
· Processors: 2 x dual core (4 procs in the OS’ eyes) 2.2 Ghz AMD Opteron Processors
· RAM: 8 GB - The MOM 2005 servers only had 4GB of RAM, but with the support for RAM above 4GB with the 64bit OS’ the increase was merited.
· Drives: 3 SAN drives for hosting the SQL data, log and TempDB files respectively.
o SQL Data drive: 130GB RAID 0+1 – In MOM 2005 this was a RAID5 30GB drive.
o SQL Log and TempDB Drive: 20GB RAID 1 – In MOM 2005 these were RAID5 8GB drives.
· OS: Windows Server 2003 Enterprise x64 Edition with SP1
With the platform listed utilization has been averaging 25% CPU utilization and memory usage resulting in an average of 1.4 pages/sec. Drive utilization has a bit more than doubled for the SQL data drive but reduced in half for the SQL log file. The drive utilization for the TempDB has remained fairly flat.
o Drive holding DB data file: 1.98 MB/sec with an average of 151 transfers/sec
o Drive holding DB log file: 0.3 MB/sec with an average of 72 transfers/sec
o Drive holding TempDB files: 1.07 MB/sec with an average of 20 transfers/sec
Moving beyond the hardware itself, two of the most significant changes made in IT’s deployment designs for the OpsDB were implementing Clustering and SQL Log Shipping.
Over the lifespan of Microsoft IT’s MOM 2005 deployment they found that in their environment, achieving 99.9% availability for the entire infrastructure for a single month was quite difficult. One of the major contributing factors that counted against their availability was that the OpsDB was a single point of failure and therefore every minute of it being offline counted against the availability of the overall infrastructure. So given that experience and the fact that customer requirements for availability of monitoring were only getting more stringent the IT monitoring team decided to implement clustering for high availability of the OpsDB. Now work such as patching servers, repairing or upgrading hardware, etc. can be performed on a single node of the cluster at a time, while maintaining the availability of the DB itself. Following are a couple configuration side-notes that relate to how MSIT configured clustering:
· The cluster model used is “single quorum device server clusters” and the quorum resource is stored on a 2GB shared drive.
· The MSDTC resource that is required for the clustered installation of SQL server is located in the same resource group as the quorum drive.
So while clustering provides high availability, it does not necessarily solve the problem of redundancy during a disaster. With MOM 2005 MSIT has implemented a derivation of the Service Continuity solution accelerator. While this solution worked well, it had the significant drawback of needing to “ship” a complete copy of the OpsDB at least once a day. Even with compressed backups this resulted in moving 8GB per management group every 6 hours. So designing the OpsMgr 2007 deployment IT needed something better for geo-redundancy of the DB. The solution was SQL log shipping. This configuration has allowed for geo-redundancy but it is worth noting that it comes with additional considerations of running the DB in Full Recovery mode and needing to maintain DB and transaction log backups. Following are some relevant configuration side-notes about the setup of log-shipping:
· SQL cluster and failover SQL server has a 30GB shared drive dedicated to storing SQL data file and log backups.
· A full DB backup is performed for the OpsDB data file every 24 hours.
· Log Shipping is configured to backup and ship the log files every 15 minutes.
· Log file backups are retained for 2 days on the source DB server and for 3 days on the destination server.
Additional SQL Customizations
As was mentioned above Microsoft IT made some additional customizations, specifically around implementing Full Recovery model and Log Shipping for the OpsDB. Following are some additional customizations worth mention.
Trace Flag 1118
This has been found to be rather unique to IT’s deployment but based on their experiences with beta/RC versions of OpsMgr 2007 they did achieve some performance improvements of the OpsDB by running SQL with the 1118 trace flag enabled. Trace flag 1118 is used for striping tempdb to overcome file contention. The following steps were taken to configure these optimizations:
In MOM 2005, in order to better achieve performance objectives, Microsoft IT altered the out of the box grooming policies on the operational DB to keep the overall DB size down. The experiences with OpsMgr 2007 have lead them to the same conclusion. The following chart shows the results of a summation of the amount of space used by each data type (for which grooming settings exist) across all three production management groups:
Considering the sum of all space consumption Performance Data and Event Data were the two largest consumers in the OpsDB. Performance Signature Data was a close third to event data followed by State Change Events Data. Considering the average per day of space consumption Performance Data was still the largest user but Performance Signature Data was the second largest. This was then followed by Event Data and State Change Events Data. As such Microsoft IT customized their grooming settings on the following data types:
· Performance Data: 3 days
· Event Data: 3 days
· State change events data: 5 days
The 3 day value for performance and event data was chosen as its long enough to span a typical weekend, and 5 days for state data as it would span a typical work week. The updated grooming settings resulted in the following reduction of OpsDB size.
The following documents are a great point of reference for setting up clustering:
· SQL Server 2005 Books Online: Failover Clustering - http://msdn2.microsoft.com/en-us/library/ms189134(SQL.90).aspx.
· “Server Clusters” topic on TechNet: - http://technet2.microsoft.com/WindowsServer/en/library/32c40202-1043-4211-8dba-bb57356f46811033.mspx.
Following are a few additional resources that were helpful in researching SQL log shipping.
Over on the MOMTeam blog Cory Delamarter posted a great write-up on Microsoft IT's deployment of System