In order to make any kind of performance tuning recommendations, it is first necessary to capture a baseline of normal operating conditions. What I’d like to demonstrate in this post is how to measure the performance of your Operations Manager database disks. The disks hosting the Operations Manager database must be able to comfortably sustain average read/write transactions, as well as handle the large bursts that occur frequently. As with any OLTP database, the Operations Manager database is no exception in that there is a high volume of smaller, random read/write transactions.
The question that has come up with many customers is:
Is this configuration, in fact, enough to support my Management Group? Or, perhaps, is this over-built?
Before I spill these beans, I want to throw a disclaimer out there stating that I am not a hardware expert, and my expertise is not in SAN or RAID configuration. I am by no means an authoritative subject matter expert in this area. I just happened to know a little math, and have experience in collecting performance data and crunching those numbers. Plus, I have some very smart colleagues (Michael Pearson from the Performance Test Team). I also have to give credit to my wife, for helping me on the RAID 5 calculations. :-)
By the way, I’m not doing rocket science here. Anyone can come up with these formulas, and run through a similar process. I’m sharing this now because I haven’t found a single good resource regarding this. I hope this post serves the community well.
Here’s what I use to gauge the performance of a Operations Manager database server role.
Given the disk configuration above, we can calculate what we expect to get for IOPS. Looking at the “assumptions” in the above table, we assume the disks comprising E and F are rated at 175 IOPS per disk, and the disks comprising G are rated at 125 IOPS. So, for each array, the math looks like this.
Prelude to Measuring IOPS
When gauging performance, it’s always a good idea to calculate both average samples and maximum samples. The reason for this is because intensive Read/Write applications need to have a sufficient disk subsystem to handle significant bursts of both Read and Write type IO, and Operations Manager is no exception. If your disk subsystem is not built to handle these bursts, this can be the origin of a bottleneck in overall Management Group performance.
A quick note about capacity planning, and how it relates to estimating Read or Write type IO bursts. This is difficult to determine without first knowing which type of IO a unique Management Group will produce more of. Cumulative knowledge about making these types of estimates during a capacity planning and design stage can be applied, and a fair recommendation can be made. But, again, I’m not going into capacity planning in this post, so I won’t wander off too far.
However, the only point I want to make is, until the Management Group is built out and all agents and Management Packs are installed, and the Operations Console has been delivered to your operations staff, a precise number for types of IO (Read or Write) and recommendations for Read / Write Cache ratio on your RAID controller cannot be made.
This is why it is important to gauge the performance of your Management Group after everything is in place, to verify the measured IO meets the expected IO. If things don’t match up, further configuration “tweaks” can be made.
If we sample Disk Reads/sec and Disk Writes/sec, for a period of time, we can calculate average IOPS as follows. Remember that we’ll also be figuring Maximum into our review.
I recommend sampling these counters for each instance, for a 24 hour period, with at least 1 minute interval. Of course, the more samples you collect, the more accurate the calculations will be. However, I don’t think it’s necessary to sample more than once every 15 seconds.
This is the average IOPS you are currently getting out of your configuration. Using the formulas in the table above, the math looks like this.
NOTE: Maximum is calculated using top IO sampled for each type during the sampling period. So this is the theoretical maximum IOPS for the array.
Looking at the average and maximum IOPS and comparing them to the expected IOPS in the previous section, we can see that we are well covered for the array hosting the OperationsManager data file. In fact, we aren’t seeing IO bursts that even come close to the expected IO capabilities for the array.
The array hosting the OperationsManager log file appears to be busier, as far as a consistent average goes. But it’s still well within reason when comparing the expected IO for the array. However, there is reason for concern with the Write bursts that are occurring on that array.
A major concern here is the array hosting the TempDB. The total measured IOPS activity on that array exceeds the expected IO for the array. Given this, and the fact that the Write IO burst that are double the expected IO capacity for the array, the array hosting the TempDB needs some serious attention. This is likely causing quite a bottleneck on the Operations Manager database server role.
I would certainly add additional spindles to the array hosting the TempDB. If, for some reason, this wasn’t possible due to budget constraints, I see another option is available.
Since the array hosting the OperationsManager database is humming along without coming close to consuming the available IO afforded by that array, we could borrow 2 disks from that array and add those to the array hosting the TempDB.
With this configuration, the F drive will still have plenty of room to perform all IO’s, including the IO bursts we’ve observed. This will lend more performance to the E drive, where it is needed most. Giving the E drive the additional 2 disks will allow TempDB to perform it’s average IO’s comfortably at ~592 IOPS. The high Write bursts will still exceed maximum achievable IO for the array, but this is much better than the current configuration.
Measuring Types of IO
From the numbers in our measured ROIPS calculations, we can determine what type of IO transaction is happening more often. This can help in determining what the Read / Write Cache Ratio should be adjusted to on the controller. We can find this by calculating the Read or Write IO, against the total IO. We’ll do this for both Average and Maximum.
Taking these percentages of IO types, we can determine a feasible Read / Write Cache Ratio on the controller. As you can see, most IO is Write. There’s only one exception. And that is the Read type IO bursts on the array hosting the OperationsManager database. This Read type IO is likely due to a high number of Operations Console sessions, navigating to multiple different views.
Because console sessions submit nothing but Read type IO to the OperationsManager database, I would surmise that this particular Management Group employs the use of the Operations Console quite heavily.
One of the toughest things I've found is that when I'm sizing a new install and I want to figure out IOPS there doesn't seem to be any definitive way or method to do this. I've read the few post regarding DB and DW sizing but what I'd like to do is size those two based data from the System Center Capacity Planner and then figure out the IOPS so I can recommend a disk sub-system to my clients or what they could do in order to improve DB/DW performance based on the sizes of the the databases.
What you've written here I think will help but it's hard to make predictive call on these sorts of things. :P
Bryan - thanks for the feedback. Capacity planning and performance tuning in Operations Manager is just as much an art as it is a science :)
I’m hesitant to write any articles with hard facts about capacity planning because of this – there are just too many factors involved that can potentially change the rules significantly.
I’ll continue searching for ways to get this information across, but it will probably end up being the reader that determines their own path in the end.
I completely agree, [others] have posted stuff on the subject and the information is helpful it's unfortunate that there isn't more on IOPS for the databases even using the Capacity Planner and then figuring out IOPS would be great which is why your post was really helpful in explaining things. I wish the Capacity Planner could give folks a rough-educated-guess on possibly good IOPS we should look for in a disk system for SCOM deployments, sure it's a rough guess but it would be a great start so that we can than delve into the 'art' portion of the exercise. :)
I'd like to make a few comments about some of the points Jonathan makes here.
This blog post talks about doing performance monitoring of your OpsMgr databases after everything is setup and running and using that load to calculate IOPS requirements. He mentions that based on this data, you can calculate the theorical IOPS limits of your disk subsystem. That's not exactly true. That tell's you how fast its gone now how fast it could go. You car's speedometer goes to 120-180MPH, but most people have never OBSERVED this. So, how do you determine what the max is? By using SQLIO and following the SQL Server Pre-Deployment best practices. Only that will tell you the limits of what your particular disk susbstem can do. Once you have that data, you can compare that to the data Jonathan shows you hwo to collect above. Then, you can see how close to the "limit" are you running. That's going to give you and the customer a LOT better idea of if they need disks or not, and how many.
Another item that I want to make clear is the IO needs of the OpsConsole. based on my testing, we've seen that some views use TempDB more than others. So, while you'd think the OpsConsole is mostly reads, there are some views that do a good bit of writing to TempDb before the "read" can be done, so keep that in mind.
Jonathan, thanks for publishing this data and thanks for giving credit where credit is due here! Great Job!
Predeployment I/O Best Practices
One more thing....
This is a recently published article from teh SQL Team around Disk Alignment. This is another pre-deployment best practice. If you would like a 30%+ gain in disk perf, I suggest giving this a read :)