Yes it has been a VERY long while since I posted. The list of what I want to post keeps getting longer and longer, as does the queue of pending requests waiting for access to me. As a result of the long queue wait times the average access times for anything I want to do are unacceptably high. I need to figure out how to better parallelize what I need to do.
What I choose to post here is information where I had a lot of difficulty getting the answers too. I.e., it isn't floating around out there at all or doesn't answer some of the specific questions I needed answers too. The challenge is that the format the answer will often work for me in is not necessarily something I can just cut and paste here. So it has to wait until I have time to polish it, as I don't have Ralph Macchio hanging around to help (wax on, wax off).
Moving on to the topic of the post, if you didn't catch the double entendre in the title and the first paragraph, this post is about storage performance and design options. This is from a scenario where there was fault tolerance was on each individual LUN and for capacity management they storage was being allocated in multiple small LUNs and concatenated at the server. This was as a result of production outages due to poorly performing storage. In working with the storage and SQL teams, this was a test that was run in response to the argument that “we don’t see performance improvements in striping over spanning”. Which was absolutely true as in the test environment they were measuring only response times as a measurement of “performance”, not total throughput needed. All us storage aficionados know that it is throughput (IOPS) demanded vs. the ability of the storage to deliver it which drives response times.
As an analogy, think of trying to get the high school football team to a game. Let’s say it takes an hour to drive to the game. Whether 1 mom/dad/coach takes a car with 3 of the players or one bus is taken with all the players, it still takes an hour to drive to the game. This means multiple trips be made or multiple drivers have to drive. Saying that the “trip” isn’t any faster doesn’t negate selection of the bus as the best option.
In short the conclusions below are really a reiteration of what we already know, more spindles exercised equals more throughput. Spanning was essentially throttling full performance throughput of the storage to just the LUN which the active data was on.
For the critics out there, I know this isn’t real world and read/write ratios and higher costs of writes as well as aggregation of writes at the array controller impact total throughput. The goal here is to explore the relationship between load driven, response times, where throughput maximizes while minimizing complexity of the test harness. The relationship is what is important and will be consistent even if the storage configurations change. (Note: This is in the comments at the end, I put it here for all those who won’t read all the way to the end before posting feedback).
Striping vs. Spanning
Striping!!!
In a spanned set, data is only read to or written from the sub set of disks which hold the data needed. If all data is consumed all the time, this will eventually balance Input/Output (IO) as the storage fills. In the meantime, and for scenarios where only a subset of data is (think most recent month of 5 years of historical data in a database) only the spindles containing that data will be used.
For reference, minimal load to a fully loaded, but not overloaded disk, should respond to the operating system in 4 to 6 milliseconds (ms) on average, depending on the disk speed. Disk speeds will not go below 4 to 6 milliseconds due to physical limitations of the mechanical device. Therefore, as the IO requests from the Operating System and Application arrive at a rate greater than the storage can service the requests, said IO requests begin to wait in the queue. Thus the more requests that can not be serviced immediately, the greater the wait times. Degraded is considered to be in the 15 ms range, Critical in the 20 ms range. NOTE: Cache will lower disk times, but caches WILL become saturated under sustained load in excess of what the storage can support and as such should not be included in planning for overall support load. Instead they should be looked at as an accelerator under normal load conditions and a buffer for transient load conditions. These tests were done WITH a cache on the SAN, so even if the belief that caches magically fix all evils, it can be observed here that there are still limits even on SAN and cache.
Note: ALL data below was configured on the same server on the same 3 LUNs, only the partition type was changed.
Below (Figure 1) is the overall performance picture of the performance of the spanned system. As it is quite small, as specific areas are called out there will be a zoomed version near said text.
Figure 1
Figure 3
This is the same 3 LUNS reconfigured as a stripe.
Figure 4
Figure 6
IOMeter – www.iometer.org
Perfmon – included in Windows OS
Disk Manager – included in Windows OS
This was done to demonstrate the change in performance between striping and spanning, as well as illustrate the impact on changing load. As such, the IO profiles were simplified to exclusively be random read IO in order to present the worst case scenario (as random read IO has a very low cache hit rate) and minimize variability in results due to a cache optimizing writes to/from the storage. Therefore, the maximum throughput demonstrated in this test does not reflect the impact of write IOs. As a result, total throughput numbers will not be accurate for a real world production scenario, however the relationship between striping and spanning will remain similar. In short, the behavior pattern is able to be generalized, while the raw throughput numbers are not.
Additionally, scoping the test file size to reside on only one "Physical Disk" is not applicable to all scenarios. However, there are many scenarios where, due to data locality, this can easily be highly representative of real-time access.