• Using your iPad, iPhone or Windows Phone with Lync 2010 and Office 365

    I have had a few recently from customers wanting to use Lync 2010 Mobile on their shiny mobile devices. I thought I would go through the steps here to connect iOS devices, such as iPad and iPhone as well as Windows Phone WP7 devices to Lync Online in Office 365.

    Prerequisites

    This post assumes that you have configured your tenant correctly and have all of the necessary DNS records in place. To check what you should have in pace

    Check that each and every one of the DNS records required for Lync is in place and configured correctly. It is not worth continuing unless this has been done (because it won't work! ).

    iPad and iPhone

    I am going to use my iPad here but the same approach should work for your iPhone.

    First we need to download the Lync 2010 App from the App Store, the easiest way is to search for it in the App store on your device but you can do it via iTunes as well if you need to.

    Once downloaded you should now have the Lync 2010 App icon on your device…

    When you first start the app it will ask you to login, you will need to know the following information

    • Sign-in Address: This is your UserPrincipalName
    • Password: Yeah, this is your password
    • User Name <Click More Details>: This is your UserPrincipalName

    Your UserPrincipalName (UPN) is usually the same as your email address, but it may not be. This will be the name you use to login to the Office 365 portal with.

    Enter the login details at the Sign-in page and tap Sign-in…

    Lync will then validate your credentials and attempt to log in…

    Once logged in you should see the contacts page – as you can see I don't have many friends in my lab

    There is an info screen for settings and general configuration…

    Plus a screen showing your current Chats…

    I really just wanted to show how to connect in this post, so I am not going to dig into any of the detailed usage here.

    Windows Phone

    Firstly we need to download the Lync 2010 app from the Windows Phone Marketplace. The easiest way to do this is to search for it on your WP device. Once installed, you should have an icon for Lync 2010 – I have pinned mine to the start screen for easy access and the live tiles.

    Note: WP does not have the ability to take screenshots so I am attempting to do my best with my HTC Mozart and my Cannon DSLR – I will apologise for the quality before we go on

    On opening Lync 2010 will request your credentials, these are exactly the same as for the iPad and iPhone versions..

    • Sign-in Address: This is your UserPrincipalName
    • Password: Yeah, this is your password
    • User name <Click More Details>: This is your UserPrincipalName

    Your UserPrincipalName (UPN) is usually the same as your email address, but it may not be. This will be the name you use to login to the Office 365 portal with.

    Tap the circle with a tick inside at the bottom and Lync 2010 will attempt to log you in to Office 365…

    Once connected for the first time Lync 2010 will run through a few initial start-up screens…

    Conclusion

    Hopefully, this post shows that's it's pretty easy to connect your mobile device to Lync 2010 in Microsoft Office 365. The most common causes of connection problems that I see are either incorrect DNS records or ADFS authentication problems (for federated identities obviously!).

    I would like to dig into these mobile a clients further but this post is already much longer than I anticipated, so I am going to leave it as just a connection post and do something in the future about the features and functionality that these clients provide.

  • Office 365 Proxy Server Exclusion List – Office 365 Service URL’s

    I am just posting this here since it's a bit of information I regularly provide to customers and I always have trouble remembering where I found it (yes I know I should create a browser bookmark, but I thought I would share!).

    When you deploy Office 365 one of the most difficult (and poorly documented) areas is network connectivity for clients and servers on terra firma connecting to the cloud. My recommendation is to bypass your normal internet proxy servers for all Office 365 services to provide the best experience for your earthlings trying to access their cloud based email – this obviously raises the question "Which names do I need to exclude?" and that's where the following TechNet page comes in…

    Office 365 URLS and IP Address Ranges

    http://onlinehelp.microsoft.com/en-us/office365-enterprises/hh373144.aspx

    Exchange Online URLs and IP Address Ranges

    http://technet.microsoft.com/en-us/exchangelabshelp/gg263350

    RSS Updates for URL and IP Address Range Changes

    http://go.microsoft.com/fwlink/?linkid=236301

     

    Exactly how you engineer your bypass list is specific to your network topology and technology.

  • Outlook Performance Troubleshooting including Office 365

    I have been involved in a number of discussions recently regarding Outlook performance troubleshooting in the cloud. Mostly these discussions were in the context of why the customer didn't want to move to the cloud since they figured it would be impossible to troubleshoot Outlook performance afterwards

    Discussion Summary:

    When we have clients and Exchange on terra firma we can monitor some performance counters such as RPC Average Latency on Exchange and use the Outlook and client performance counters to establish if a poor end user experience is being caused by the Exchange Server, the Network or the Client machine. If we move the messaging service out to the Office 365 cloud we can no longer monitor RPC Average Latency so we don't know if poor performance at the client is being caused by network or the Exchange server.

    Outlook Performance

    This started me thinking about how to deal with this situation and what items make up the client experience from an Outlook performance perspective.

    The following items can both have a fairly dramatic effect on Outlook client performance and either could cause the end customer to pick up the phone to support and say that "E-mail is slow".

    • MAPI RPC Latency
    • Client system performance

    If we make the assumption that our service is running in the Office 365 cloud, how do we go about determining the actual cause of Outlook performance problems?

    MAPI RPC Latency

    RPC Latency is made up of two parts

    • Server side RPC processing
    • Round-trip-time Network Latency

    Network latency is probably the easiest to examine on the surface since we really just need to use ping.exe to find out what our TCP round-trip-time (RTT) value is to the target server. There is a snag though, as you might expect…

    Here is the ping response from my Office 365 server…

    Not exactly useful since ICMP Echo is blocked at the external firewall. So, if we can't use ping.exe how do we determine our Network RTT latency? Well, luckily Outlook has us covered here and keeps a track of some stuff that can help us out…

    In your task tray you should see an Outlook Icon. If you hold down CTRL + RIGHT CLICK on this icon it will show the Outlook context menu…

    From the Outlook Context menu select "Connection Status"

    In the Connection Status dialog box find the columns called Avg Resp and Avg Proc. The difference between these two values represents the network latency for each connection.

    In this example you can see that I have two logical connections listed as Mail (To see the physical TCP connections use TCPview). This is normal for a cached mode Outlook 2007+ client. One connection is used for item synchronisation and the other is reserved for sending new messages. This architecture prevents sending a large message from blocking Outlook receiving new items like it did in Outlook 2003.

    Generally speaking the connection with the larger Req count is for synchronisation which is the one we will use in this example.

    • Avg Resp is 87ms
    • Avg Proc is 10ms

    This means that my Network RTT time is 77ms and the Server side RPC processing latency is 10ms.

    In my example this makes perfect sense since I am based in the UK and my mailbox is hosted on Office 365 in North America. It also shows that my Network and Server latency are within acceptable limits.

    Generally speaking I use the following recommendations to maintain a good client experience in Cached mode for Outlook 2007 and later.

    • Max Avg Proc Time (Exchange RPC Latency) = 25ms
    • Max Network RTT Time (Network Ping Time) = 300ms
    • Max Avg Resp Time (Exchange RPC Latency + Network Latency) = 325ms

    Once armed with these values it is possible to direct troubleshooting more specifically. For example, if Network RTT is high you could look at your network links or firewalls. If the Avg Proc time is high then a call to Office 365 support might be in order.

    One final point here is to check the Req/Fail column. A high value for Fail represents high number of network disconnection events. If this is combined with a high Avg Proc time it potentially points to a service issue in Office 365, however if Avg Proc is good then it suggests that you may have a network connectivity problem between the client and the service. A common cause of this is source port exhaustion for environments with more than 2000 users.

    Client System Performance

    So what happens if the Network and Exchange RPC metrics are all good but the end customer is still experiencing poor Outlook performance? Since we have ruled out Network and Exchange performance the most likely culprit is the client workstation.

    So what could be causing Outlook performance problems on the local workstation?

    For this we need to look at the usual trinity of performance areas within the operating system

    • CPU
    • RAM
    • DISK I/O

    To take a look at these further I am going to use Process Explorer.

    Client CPU

    Outlook is generally not that CPU intensive, however if your CPU is flat out doing other stuff then Outlook will respond slowly. To check this, open Process Explorer and arrange the table in descending CPU order.

    We are looking for a few things here. Firstly what is our System Idle Process value? This tells us how much CPU time we have spare. Generally speaking if this value is less than 20% the system will feel sluggish. In my example you can see that I have plenty of CPU time available and so it is unlikely CPU is an issue here.

    If Outlook appears at or near the top of this list then the most likely culprits are that you have a faulty add-in installed, (try running Outlook in safe mode), or that your OST file is damaged (Try running the Inbox Repair Tool).

    To get a better idea of how Outlook is consuming resources, find OUTLOOK.EXE in the Process list, double click it and then open the Performance Graph tab in the properties dialogue box.

    This will show some historical values for CPU Usage for the Outlook process. Even a large Mailbox (mine is 10GB) shouldn't require Outlook to take up a large amount of CPU time.

    Client RAM

    Insufficient RAM has a number of effects on Outlook. Firstly the process can be starved of physical RAM and so run slowly; secondly the operating system will have to page large chunks of memory to disk which will cause disk I/O problems. Since we are going to look at disk I/O in the next section, I will just look at identifying client memory problems here.

    It is important to realise that Windows will page out an amount of memory to the page file and this both normal and advantageous. However, where the system has significantly more committed memory than it can accommodate in physical RAM we may run into performance problems as the process accesses its data in virtual memory and instead has to wait for that data to come from disk. This process is known as hard paging. A sustained high level (>5) of hard page faults is a strong indicator that there is not enough RAM in the system.

    Unfortunately Process Explorer can't help us here since it shows a combination of hard and soft paging. For this task we are going to need to break out Performance Monitor (perfmon.exe).

    Open, Perfmon and then add in the MEMORY\Page Reads/sec counter.

    You can see clearly that this system has to retrieve data from the page file frequently. In fact this is from a virtual machine running Windows 7 and Outlook 2010 in 512MB RAM with a 5GB mailbox. Almost every single action performed within Outlook triggers a spike in Page Reads/sec. The user experience is very slow, however if we look at the Exchange Connection status for this client…

    • Avg Resp = 4ms
    • Avg Proc = 0ms

    This clearly shows that the poor user experience is being driven by the client and not by the server or network and more importantly that a bit of extra RAM is likely to make this customer happy – clever right? J

    Client Disk I/O

    This is a bit of a soapbox of mine at the moment. As Exchange professionals we go to great lengths to monitor our messaging service on the basis that we want to provide the best user experience. However, the reality is that the most likely cause of poor user experience is accessing a large Outlook OST file stored on an underperforming client system. Over time these OST files generally reach roughly double the size of your mailbox. If we take Office 365 as an example with a 25GB quota, this means that it's not impossible for a user to have a 50GB OST file on their laptop HDD. Let's think about that for a minute… we have a 50GB file, with data in it that we need to access quickly – if that was a Word document or Access database most users would accept a minute or two's delay as it was opened and yet we expect Outlook to open it in 5 seconds or we think something is broken J

    Speedy access to a large OST file access relies on two things…

    • A fast HDD that isn't dealing with too much other stuff
    • A mostly contiguous OST file

    HDD Usage

    If the hard disk drive is busy doing other things then Outlook in cached mode will perform slowly and deliver a poor user experience.

    Physical Disk\Disk Queue Length

    This is a measure of how many requests are waiting for the disk. Ideally the disk queue length should be no more than 1.5-2 times the number of disks that make up the volume. For most client workstations this means that disk queue length should be <2. As a general rule don't provide older laptops with 5400rpm HDD's very large mailbox sizes, i.e. don't just give everyone a 25GB mailbox without checking to see if their hardware is going to cope J

    Contiguous OST File

    Since Outlook makes frequent reads and writes to the OST file it can become very fragmented over time. A heavily fragmented OST file (>1000 frags/file) can lead to poor Outlook client performance.

    The easiest way to check and defragment the OST file is via the sysinternals tool contig.exe. First close down any applications that may be accessing the OST file, such as Outlook or Lync.

    To check the OST file for fragmentation use the command

    • Contig.exe –a <path to OST file>

    Note: You must be running contig.exe with elevated privileges to perform defragmentation.

    To defragment the OST file use the command.

    • Contig <path to OST file>

    Conclusion and further reading

    Outlook performance is just like any other application. It relies on CPU, Memory, Disk and Network. If any of those resources are performing badly then the end customer is likely to experience poor performance. Exactly the same troubleshooting processes apply on-premises or for Office 365 users. The only real difference that applies to Office 365 performance troubleshooting is that we cannot directly observe Exchange performance counters and so we need to rely on the data that Outlook provides us.

  • Performance Monitor Tips and Tricks with John Rodriguez

    Hello everyone … My name is John Rodriguez, and I'm a Principal PFE based in Minneapolis, MN. I'm also one of Neil's frequent collaborators, particularly on load simulation tools. When he wrote his recent article on performance analysis of Jetstress data, I suggested that he include some of our time-saving tips. He's a busy chap, and asked me to write the article instead. So, to that end, I'm here as a guest contributor to share with you some of the things we as performance specialists do within Perfmon. [Neil: Actually John taught me Exchange perf when I first joined Microsoft in 2007 and again when I went through MCM; it is a great honour to have him write stuff for my blog, so thanks again John J].

    If you spend a lot of time in Perfmon, whether for general performance analysis, reviewing Jetstress data, or trying to identify the source of a bottleneck, you tend to look for ways to automate tasks – to simplify things. Thankfully Perfmon provides some extremely useful interfaces which provide us with our time-saving opportunities.

    When you first launch Performance Monitor, and switch to the Performance Monitor node, you should see the general screen with a single counter (usually Processor(_Total)\% Processor Time). In the example below, I've deleted the default counter and added four specific ones of great Exchange importance: MSExchangeIS\RPC Averaged Latency, \RPC Operations/sec, \RPC Packets/sec, and \RPC Requests.

    How did I add those counters? Believe it or not, I pasted them. Pasted? Into Perfmon? Really? Yes, really. Here's how.

    When you first launch Perfmon and want to add counters, you can click the green Plus symbol, or right-click the display and select "Add Counters…" as shown below.

    Now, if you look at the context menu, you can see an option to "Save Settings As…" This is one of the most underappreciated items in Perfmon, and enables a lot of incredibly useful behavior. "Save Settings As" predictably enough saves the existing set of counters to an HTML file which you can then open and edit to your heart's content. If you right-click the file and select Open With > Notepad, you should see a bunch of entries beginning with <PARAM NAME="Counter#" and then details on the counters themselves, like so:

    [It's important to select Open With > Notepad, rather than double-click on the file. We'll see why in a minute.]

    Notice that there's no server name in this data – the value is just the counter name – "\MSExchangeIS\RPC Averaged Latency", "\MSExchangeIS\RPC Packets/sec", and so on. This means that this set is rather portable – to our great advantage. I can use this information to automatically add a whole group of counters to Perfmon at once. But we're not going to load them from the file – we're going to paste them into Perfmon! This may sound odd, but Perfmon actually supports copy-and-paste. If I want to add all of those counters at once, I can simply open that HTML file with Notepad, copy the contents (select all, copy), open Perfmon, click anywhere in the right-hand pane, and then click Ctrl-V, Perfmon will add all of those counters at once. In other words, if you add the counters to one server, and save the settings to HTML, you can quickly add all of those counters for any other server! This is extremely useful, but we're only just getting started.

    Now that we have our settings file (in my case, "rpc counters.htm"), we can perform a few little tricks. First, since it's HTML, we can actually open this settings file with Internet Explorer. Unless you've weakened the security settings for IE, you'll need to click "Allow blocked content", but do that and you see Perfmon embedded within Internet Explorer:

    Notice that Perfmon includes not just the counter list I added before, but visible data points as well (which was saved in the settings file). But more importantly, this is a fully functioning instance of Perfmon. Notice the two green buttons in the menu bar at the top of the Perfmon instance – the one on the left (similar to the "Play" button on a CD or DVD player) switches the display to live data, while the second (the "Skip Forward" button) leaves the display paused but updates the data to the present moment.

    The green "Play" button has changed to a blue "Pause" button, and many of the other buttons have also been enabled as well (including add and remove counters).

    "This is nice," you may say to yourself, "but what good does it do?" Well, for one thing, you can select a specific set of counters, save the settings file, and then distribute the settings file to your teammates so that they can open that same set of counters. This is very useful if you have a large set of counters and want to make sure that everyone's looking at the same data.

    Where the "Save Settings As" option really becomes useful is when you're working with performance logs (BLG files). In this case, I've captured performance data using a custom Data Collector Set, and the results of the data are saved in C:\Perflogs\Admin\New Data Collector Set\DC1_datetime.log. I switched back to the Performance Monitor tab, and selected View Log Data from the action menu (you can also press Ctrl+L to do the same thing).

    On the Data tab, I select all four counters from the log file and add them into the list to display.

    Because I don't have very much data in the BLG, the resulting display isn't terribly exciting. But it's not the display that I'm concerned with here – it's what I can do with the data I selected. Again, I right-click and select "Save Settings As" and save the list of counters to another HTML file. [Side note: when you select "Save Settings As" when viewing data from a BLG, other options become available, including Save Data As, which gives you the option to reduce the size of an existing BLG by saving only a subset of data points.]

    However, when you open this second HTML file, you'll notice that things look a little different. The counter names are prefaced by the server name:

    My lab machine is named DC1, but if I want to use this on a performance log I collected from DC2, all I need to do is a simple find-and-replace in the HTML file, open the appropriate saved log (View Log Data), and then perform the same copy/paste trick listed above. Again, Perfmon will add all of the relevant counters directly into the display. That means that you can create the list of counters once and use that same list for every single BLG you want to review. Just change the name of the server within the HTML file for each server, then paste the contents.

    To summarize, Perfmon allows you to save and load sets of performance counters, and you can use this functionality to make your performance life a little easier. By saving the settings to disk, you can ensure that you use the same list every time, even if you've closed Perfmon. You can launch Perfmon from within Internet Explorer using a saved settings file, which helps ensure consistency so that everyone uses the same counters. Last, you can use a settings file as a template to view the same set of counters from different BLG files from different servers.

    Hopefully these tricks help you become more efficient in your use of Perfmon. Let us know in the comments!

  • Analysing Exchange Server 2010 Jetstress BLG Files By Hand

    Quite often when I am working through Jetstress escalations I will ask to see the BLG files from the test. These files contain performance counter information that was logged during the test run and they show us a lot more about what is really going on than the data in the report HTML files.

    To keep this post short, let's assume that you have already read through the Jetstress Field guide and move on to the fun stuff straight away J

    Why Analyse the BLG file by hand?

    Given that Jetstress already parses the BLG files and searches for various counters and values, why on Earth would we ever need to look in the BLG file ourselves? There are a number of reasons..

    • Establish failure severity

      Was the failure caused by a single event, or do the performance logs show prolonged and repetitive issues? This is often useful when trying to put together a resolution plan – can we fix this by adding a few extra spindles or are we going to need a significantly bigger storage solution?

    • Analyse failure mode scenarios

      When we test for failure modes, such as disk controller failure etc. the test may fail due to one very high latency spike, however as long as the test resumes in an acceptable timeframe and performance is acceptable we would conclude that the test passed even though Jetstress reported it as failed.

    Finding the Jetstress test BLG File

    Jetstress defaults to storing all test data in the folder in which it was installed. When you look in that folder you are looking for files of type Performance Monitor File. Given that you are probably not going to perform this analysis on the Exchange server, make a copy of the file to your workstation.

    Hint:

    BLG files are generally very compressible, so if you need to copy them over a WAN it is worth compressing them first.

    Opening the Jetstress test BLG File in Performance Monitor

    For this section I am going to assume that you have access to a Windows 7 workstation. One of the most frequent mistakes that I see people make at this stage is to simply double click on the BLG file… this will automatically open the trace file in perfmon and open the top 50 counters… firstly this takes ages and secondly it looks like this… which as you can see is pretty busy and pretty useless…

    So, instead we are going to open Performance monitor and then open our BLG files. Performance monitor is stored in the Administrative Tools section of your start menu…

    • Start -> All Programs -> Administrative Tools -> Performance Monitor

    Once Performance Monitor has started…

    • Select Performance Monitor under the Monitoring Tools section
    • Click on the View Log Data icon

    • Check the Log Files radio button, then click Add
    • Select the Jetstress BLG that you want to analyse
    • Click OK

    You should now be looking at a totally empty Performance Monitor page…

    Analysing the Jetstress Performance Data

    Now we have Performance Monitor open and our Jetstress BLG file attached, the next step is to begin looking at the data. Before we begin this it is worth a quick recap of the counters and objects that we are interested in during an Exchange 2010 Jetstress test.

    The best place to read about Exchange 2010 performance counter values is here:

    The following values are of specific interest when analysing Jetstress performance files.

    Counter

    Description

    Threshold

    MSExchange Database Instances(*)

    \I/O Database Reads (Attached) Average Latency

    Shows the average length of time, in ms, per database read operation.

    Should be <20 ms on average.

    Less than 6 spikes >100ms

    MSExchange Database Instances(*)

    \I/O Log Writes Average Latency

    Shows the average length of time, in ms, per Log file write operation.

    Should be <10ms on average.

    Less than 6 spikes >50ms

    MSExchange Database(JetstressWin)

    \Database Page Fault Stalls/sec

    Shows the rate that database file page requests require of the database cache manager to allocate a new page from the database cache.

    If this value is nonzero, this indicates that the database isn't able to flush dirty pages to the database file fast enough to make pages free for new page allocations.

    MSExchange Database(JetstressWin)

    \Log Record Stalls/sec

    Shows the number of log records that can't be added to the log buffers per second because the log buffers are full. If this counter is nonzero for a long period of time, the log buffer size may be a bottleneck.

    The average value should be below 10 per second. Spikes (maximum values) shouldn't be higher than 100 per second.

     

    Examples

    The following are from some recent example tests.

    Normal Jetstress test results

    Before we move on to more interesting data I thought it would be useful to show what good test data looks like…

    Database Read Latency

    The following chart shows the MSExchange Database Instances (*)\I/O Database Reads (Attached) Average Latency values for the test. In this instance it is clear that all instances average is below 20ms and there are no read latency spikes. I have discarded checksum instances since they are not required. The _Total instance is highlighted in black and is discussed below.

    I often see people quoting the _total instance for database read latencies. The _total instance is simply an average across all observations for that point in time and so serves no purpose other than to obscure the latency peak values.

    The default Jetstress sample time is 15s which is already too large in my opinion (I recommending dropping this down to 2s for manual analysis – edit the XML to do this) so using an average value across all instances usually serves to make the results look better than they really are.

    In the example above I have highlighted the _Total instance to show how it flattens out the results. Do not use the _total instance; it will often hide latency spikes from your results.

    Log Write Latency

    This chart shows the MSExchange Database Instances (*)\I/O Log Writes Average Latency. As you can see the results show that the write latency values are way below the 10ms average and there are no spikes.

    Log Record and Page Fault Stalls

    This chart shows both Log Record Stalls/Sec and Database Page Fault Stalls/Sec. As you can see both counters recorded 0 stalls/sec during the test which shows that the storage was able to meet the demands required by the database.

    Failure Mode Test – Disk Controller Failure

    The following test data came from an Exchange 2007 deployment where the customer was performing failure mode analysis of their storage. This specific test simulated a failure of a disk controller within their storage subsystem. The expected behaviour is to experience a brief storage outage and then for I/O to continue as normal. Manual analysis was required for this test since Jetstress just shows a failure due to average latency being >20ms.

    The following charts show the recorded BLG data, it clearly shows that the test was actually in good shape apart from a single event caused by the simulation of the disk controller module. This event caused an I/O outage of 60 seconds, which was recorded as 4 x 20,000ms spikes on the chart. These few, very high values were enough to skew the average latency values of the test. In reality this test showed that the storage is capable of recovering from a disk controller failure in an acceptable time and the performance after the failure is the same as it was before.

    Although Jetstress reports this as a failed run, I classified this as a test pass since the storage solution recovered from the failure quickly and resumed operations at the same level of performance. The operations team are now aware what a failure of this module will look like for Exchange and how long it will take to recover.

    Figure 1: Database Read Latency

    Figure 2: Log File Write Latency

     

    Jetstress Log Interval Time Case

    This is an interesting example since the initial Jetstress test actually passed the storage solution, however the team involved were suspicious since they had previous experience with this deployment and knew that it reported storage problems when analysed via SCOM or the PAL tool. When Jetstress passed the storage they began further analysis…

    This is the DB read latency from their first test with the Jetstress log interval set to the default of 15s. Nothing unusual to report and Jetstress passed the test. The chart looks good and the maximum latency values are all below 100ms.

    The team were concerned by this and decided to reduce the log interval time within the Jetstress test XML file.

    <LogInterval>15000</LogInterval>

    Was changed to

    <LogInterval>1000</LogInterval>

    This reduced the sample interval time from 15s to 1s. The team then re-ran the test.

    This chart shows the test data for the run with the LogInterval reduced to 1s. It is clear that something is different – the average values seem fine, however the maximum values are way over 100ms suggesting latency spikes - Jetstress failed the test due to disk latency spikes.

    If we zoom in to this chart a little to get a closer look it is evident that there are a significant number of read latency spikes over 100ms during this 15 minute zoomed in window. Further analysis suggests that these spikes are observed throughout the test. The initial run with log interval set to 15s had totally missed these latency spikes.

    Just in case my point about using the _Total instance wasn't clear earlier in this post I decided to add it in and highlight it on this example. Even though it is clear that this test is suffering from significant read latency spikes, the _Total instance (Highlighted in black) is smoothing this out and hiding those 100ms+ spikes. If you were only looking at the _Total instance you would have missed this issue, even with the reduced LogInterval time.

    Conclusion and summary

    Manual analysis of Jetstress test data is not always required, however it is often useful, especially if the storage platform is new to your organisation and you want to get a better understanding of how it is performing under load.

    Manual analysis of test data during failure mode testing is very highly recommended. It is critically important that you understand how component failure will affect your storage performance under normal working conditions. The only way to do this is to take a look at the BLG data that was logged during these tests and assessing if that behaviour is acceptable for your deployment.

    The default log interval time of 15s can mask instantaneous latency spikes. In most test cases this is not a problem, however I recommend that you reduce the log interval time down to between 2-5s where more granular data logging is desired; this is especially useful if you are using a shared storage platform or are using some form of advanced storage solution (Direct Attached Storage rarely requires this reduction in sample interval).

    Note:

    Reducing the LogInterval time in your Jetstress configuration XML file will significantly increase the size of your BLG file. A 2hr test at 2s interval will create an 800MB BLG file. This also has an impact on your ability to work with the data. Performance monitor can suffer from significant slowdown with very large BLG files.