Yes it has been a VERY long while since I posted.  The list of what I want to post keeps getting longer and longer, as does the queue of pending requests waiting for access to me.  As a result of the long queue wait times the average access times for anything I want to do are unacceptably high.  I need to figure out how to better parallelize what I need to do.

What I choose to post here is information where I had a lot of difficulty getting the answers too.  I.e., it isn't floating around out there at all or doesn't answer some of the specific questions I needed answers too.  The challenge is that the format the answer will often work for me in is not necessarily something I can just cut and paste here.  So it has to wait until I have time to polish it, as I don't have Ralph Macchio hanging around to help (wax on, wax off).

Moving on to the topic of the post, if you didn't catch the double entendre in the title and the first paragraph, this post is about storage performance and design options.  This is from a scenario where there was fault tolerance was on each individual LUN and for capacity management they storage was being allocated in multiple small LUNs and concatenated at the server.  This was as a result of production outages due to poorly performing storage.  In working with the storage and SQL teams, this was a test that was run in response to the argument that “we don’t see performance improvements in striping over spanning”.  Which was absolutely true as in the test environment they were measuring only response times as a measurement of “performance”, not total throughput needed.  All us storage aficionados know that it is throughput (IOPS) demanded vs. the ability of the storage to deliver it which drives response times.

As an analogy, think of trying to get the high school football team to a game.  Let’s say it takes an hour to drive to the game.  Whether 1 mom/dad/coach takes a car with 3 of the players or one bus is taken with all the players, it still takes an hour to drive to the game.  This means multiple trips be made or multiple drivers have to drive.  Saying that the “trip” isn’t any faster doesn’t negate selection of the bus as the best option.

In short the conclusions below are really a reiteration of what we already know, more spindles exercised equals more throughput.  Spanning was essentially throttling full performance throughput of the storage to just the LUN which the active data was on.

 

For the critics out there, I know this isn’t real world and read/write ratios and higher costs of writes as well as aggregation of writes at the array controller impact total throughput.  The goal here is to explore the relationship between load driven, response times, where throughput maximizes while minimizing complexity of the test harness.  The relationship is what is important and will be consistent even if the storage configurations change.  (Note:  This is in the comments at the end, I put it here for all those who won’t read all the way to the end before posting feedback).


Striping vs. Spanning

 

Winner

Striping!!!

Return Of The Analysis

Overview

In a spanned set, data is only read to or written from the sub set of disks which hold the data needed. If all data is consumed all the time, this will eventually balance Input/Output (IO) as the storage fills. In the meantime, and for scenarios where only a subset of data is (think most recent month of 5 years of historical data in a database) only the spindles containing that data will be used.

For reference, minimal load to a fully loaded, but not overloaded disk, should respond to the operating system in 4 to 6 milliseconds (ms) on average, depending on the disk speed. Disk speeds will not go below 4 to 6 milliseconds due to physical limitations of the mechanical device. Therefore, as the IO requests from the Operating System and Application arrive at a rate greater than the storage can service the requests, said IO requests begin to wait in the queue. Thus the more requests that can not be serviced immediately, the greater the wait times. Degraded is considered to be in the 15 ms range, Critical in the 20 ms range.
NOTE: Cache will lower disk times, but caches WILL become saturated under sustained load in excess of what the storage can support and as such should not be included in planning for overall support load. Instead they should be looked at as an accelerator under normal load conditions and a buffer for transient load conditions. These tests were done WITH a cache on the SAN, so even if the belief that caches magically fix all evils, it can be observed here that there are still limits even on SAN and cache.

Note: ALL data below was configured on the same server on the same 3 LUNs, only the partition type was changed.

Legend

  • Red Line is the disk latency discussed previously. (Avg sec/Read, Avg sec/Transfer, Avg sec/Write from PhysicalDisk or LogicalDisk performance counters)
  • Blue Line is the number of operations per second performed by the storage (Reads/sec, Transfers/sec, Writes/sec from PhysicalDisk or LogicalDisk performance counters)
  • Green Line is the number of operations outstanding. This is the independent variable and is controlled by the test harness (iometer). ("Current Disk Queue Length" from PhysicalDisk or LogicalDisk performance counters)

Spanning

Below (Figure 1) is the overall performance picture of the performance of the spanned system. As it is quite small, as specific areas are called out there will be a zoomed version near said text.

Figure 1

Observation #1

  • Red line (latency) and green line (load generated) increase at the same rate.
  • This supports the previous statement that disk wait times are correlated with the amount of IO being asked of the underlying storage.
  • Real world application and idiosyncrasies:
    • This is why "Current Disk Queue Length" is suggested as a counter to gauge whether or not storage is performing well. "Current Disk Queue Length" fails in that it is a point in time counter and can not accurately represent the median load over a given time period. Thus, when processing hundreds of IO per second (IOPS), one sample every second or greater it doesn't give a very good picture of the aggregate trend.
      NOTE: This problem scenario can be observed in the drops in the green line. Even under structured loads this data is skewed.
    • One idiosyncrasy is that certain scenarios can cause the IO to be delayed "in flight" (somewhere between it exiting the queue and returning from the underlying storage). Low "Current Disk Queue Lengths" and high latencies can hint at this scenario. Due to the inaccuracies in "Current Disk Queue Length", the correct tools to confirm this scenario are native ETW tracing and tools, such as XPerf, that consume said data.

Observation #2

  • Blue Line (IOPS serviced) peaks and stays flat regardless of how much more load the test harness attempts to push.
  • Real world application:
    • Once you're done, you're done.
    • Can't squeeze blood from a stone.

Observation #3

  • The Red Line (latency) in picture to the right (Figure 2) is scaled differently (1000x) so the actual values can be seen more clearly.
  • As the latencies approach 20 ms, the maximum throughput approaches the absolute maximum.
  • Reference the previous picture where the Blue Line (throughput) maxes out pretty close to the left hand side of the chart at about 800 IOPS.

Observation #4

  • From the pictures (Figure 1 and Figure 3) below, performance maximizes at an average of about 800 IOPS

Figure 3

 

Striping

This is the same 3 LUNS reconfigured as a stripe.

Figure 4

Observation #1

  • Red Line (latency) increases at a rate roughly equivalent to one-third the rate of increase of the Green Line (load).
  • Again, this supports the previous statement that disk wait times are correlated with the amount of IO being asked of the underlying storage. The fact that the correlation isn't one to one is due to the fact that the OS sees 3 "physical disks" (each LUN is presented as a physical disk from the perspective of the OS) under this logical disk and the load is distributed across said disks. Thus each "physical disk" only sees one-third of the load, in turn only suffering one-third of the degradation
  • Real world application and idiosyncrasies:
    • In addition to previously mentioned…
    • The performance at the logical disk level can be very different then the OS "physical disk" level. By spreading load across multiple "physical disks" the logical disk gains the advantages of the best and minimizes the consequences of the worst.

Observation #2

  • Blue Line (IOPS serviced) climbs more slowly, but still eventually plateaus regardless of how much more load the test harness attempts to push.
  • Real world application:
    • Still can't get blood from a stone.

Observation #3

  • The storage has to be pushed much harder to saturate it. In the spanned scenario, saturation was reached at about 16 pending IOs outstanding. In the striped scenario, this maxed out at about 48 pending IOs outstanding.
    Notice this is a factor of 3 greater than the striped scenario. This should not be a surprise.
  • There appear to be 2 levels of saturation. One from 10 ms to 20 ms latencies and one from 20 ms and up. However, the higher level is much more volatile and "fails" down to the lower level of saturation quite often. This artifact should not be factored into scaling decisions.
  • Same as above, changed the scaling on this picture (Figure 5) so the latency value is easier to read.

Observation #4

  • From the picture below (Figure 6) the throughput maxes at about 2500+ IOPS.
  • This is a little more than 3x the spanned scenario and should not be a surprise.

Figure 6

Testing strategy

Tools

IOMeter – www.iometer.org

Perfmon – included in Windows OS

Disk Manager – included in Windows OS

Configuration

  • All Microsoft best practices were followed for storage configuration.
    • Partitions were aligned
    • File system used 64K clusters (as per SQL storage best practices)
  • 3 LUNs were configured in A) Span and B) Stripe
  • IOMeter
    • Access configuration – 64 KB IO sizes, 100% Random Read IO
    • Test Setup
      • Run Time – 60 seconds
      • Cycling Options – "Cycle # Outstanding I/Os – run step outstanding I/Os on all disks at a time
      • # of Outstanding I/Os – Start – 8, End – 256, Step 8, Linear Stepping
    • Iobw.tst (test file) was intentionally created to mostly fill one LUN (~32 GB) worth of space. This was selected to allow focus of analysis to be on the scalability of using one "physical disk" vs. using multiple "physical disks"
  • Perfmon – The above IOMeter will create a test run approximately 35 minutes in length. Thus a performance counter log was created to automatically stop after a similar period (so the test could run unattended).
    • Collect all PhysicalDisk and LogicalDisk counters.
    • Sample in 10 second intervals - thus there are multiple data points for each IOMeter step and the steps can be observed

Comments

This was done to demonstrate the change in performance between striping and spanning, as well as illustrate the impact on changing load. As such, the IO profiles were simplified to exclusively be random read IO in order to present the worst case scenario (as random read IO has a very low cache hit rate) and minimize variability in results due to a cache optimizing writes to/from the storage. Therefore, the maximum throughput demonstrated in this test does not reflect the impact of write IOs. As a result, total throughput numbers will not be accurate for a real world production scenario, however the relationship between striping and spanning will remain similar. In short, the behavior pattern is able to be generalized, while the raw throughput numbers are not.

Additionally, scoping the test file size to reside on only one "Physical Disk" is not applicable to all scenarios. However, there are many scenarios where, due to data locality, this can easily be highly representative of real-time access.