• How to analyse Application level performance for Outlook and SharePoint online

    If we've stepped through all the network level checks and all looks good from that perspective, then we need to move up the stack to the application itself and see if something above the network is causing performance issues.

    This can prove tricky with Office 365 as the information is almost always encrypted within an SSL session, however there are a number of methods we can use to look at how the application itself is performing and how long the requests we send to the remote server are taking to get a response. From this we can see if we've got a problem with the client or on the Datacenter side.

    When we're working on-prem this is a lot easier. We can normally match up a request to a response as long as it's in the clear by using a network packet sniffer such as Netmon or WireShark, so we'll be able to see an RPC call to its response in Outlook and a HTTP GET to its response in SharePoint. When these requests are encrypted then that becomes impossible. So how do we do it? Well the methods vary with each product:

    I've tried my best to make the images viewable without clicking out, any smaller ones should have a link to view full page if you need though.

    Outlook

    Outlook performance can be a tricky one at the best of times, especially so when using HTTPS.

    We can however use the inbuilt connection status tool to look at the performance. Ctrl + Right click the Outlook icon in the bottom right of your task bar and click on 'Connection Status'. This will give you a whole heap of information on Outlook's connectivity.

    Here you can see the Outlook 2013 output (the format and content varies with versions of Office)

    From here I can see how many connection to Exchange I've got, the type, and some information on the RTT and processing time (Avg Resp & Avg Proc respectively). We can use these two values together to see the RTT as measured by Outlook. If we take the cached connection with 4963 requests then we have the following:

    • Avg Resp: 29
    • Avg Proc: 6
    • Avg Resp shows us the RTT measured by Outlook.
    • Avg Proc shows us the processing time, how long the RPC processing latency is, how long the server took to construct the response. If this is high it indicates a problem on the server side.

    By subtracting 6 from 29 here, we get the latency which is 23ms

    To confirm this I can use PSPING to connect to the mailbox and this shows an average of 20ms.

    A great blog by a Microsoft colleague here describes these tests in more (and better) detail but this is a good test with inbuilt tools to show if we have latency or a delay on the Exchange side. The blog also outlines some great steps to take to look at other ways at the client and Outlook performance to see if there are any issues there

    I normally also take a network trace whilst starting Outlook, then whilst performing actions such as opening a new mailbox, switching calendars, sending a large mail and then analysing the traffic for symptoms described in my blog post.

    If you're running a newer version of Windows, you can also use Resource monitor to get a view on your round trip time for Outlook connections.

     

    SharePoint:

    There are a number of tools we can use to look at the page load performance within the browser.

    If you're using IE then the inbuilt tools are a good starting point, especially with the newer versions of the browser.

    By hitting F12 and using the inbuilt tools to trace the page load, we get information on each element of the page and how long it took to load and how big it is.

    Here we see the URLS opened and the HTTP response code.

    IE F12 Tools:

     

     

    And over to the left of the same screen we get more information on how long it took for this to complete.

     

    Fiddler:

    Fiddler is a tool which inserts itself in front of the browser and allows us to capture encrypted requests in the clear and shows us the time it takes for each request to complete and allows us to spot any problem elements of the page which is slowing us up.

    Here you can see information similar to that of the F12 tool but with more data.

     

    Over to the right when we click on a URL we see detailing information on how long each stage of this connection tool. For example we have information on how long it took to get the server response, and to complete this response. This sort of information is enough to give us an indication on a slow loading SharePoint page how quickly we got the initial response and subsequently all the data.

     

    If we select multiple URLs and click on Timeline we can see a graphical view of how long each stage took.

     

    This is just intended as an introduction to the tool, the help file is pretty good as is the support community on the website, and there is a book if you're keen. As this is a third party tool (i.e. non Microsoft) I can't vouch for it but I know we use elements of the tool in our new Message Analyzer tool.

    HTTPWatch

    My personal tool of choice however is HTTPWatch. This is a free tool which works with most browsers but you'll have to buy the full version if you want the extended features but is well worth it if you do this on a regular basis and your boss will stump up for it!

    This essentially acts as a proxy in front of your browser and allows it to see the elements of the page as they load and for me is the easiest to use and understand whilst giving me some great information on what the performance is like. Again, this is a third party tool and I can't vouch for it but we do use the full version within Microsoft and I personally use it extensively.

    I'll use IE as the example browser here but the tool works with Firefox on Windows. Once installed, if you hit F4 to get the menu up, you should see HTTPwatch as an option, click that and a window should open up at the bottom of the tab.

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    Hit record then enter your URL. Here I opened my test SharePoint page in Office 365. You can see clearly in the timechart which section took the longest to load (the one highlighted took 0.9 seconds) and if you had a poorly performing section of the page, this would be as clear as day in this timechart. I've also hovered over the green line which indicated when the page's rendering started in IE. So I can clearly see here that at 1.4 seconds after entering the URL the page was visible to the user (although some elements were still coming in, in the background).

    In addition to troubleshooting slow elements of a page, you can also can use this green line to measure a baseline of page load times, either for comparative purposes to an on-prem solution or perhaps before and after a network or page structure change. It's also useful to be able to compare page load performance from different sites.

    If I then click on the time chart for that URL that took 0.9 seconds I can see where that time was spent in more detail. Here we can see the connection and SSL handshake took no time at all, but we spend most of our time waiting for a response from the server. Once we get the response, we receive the data in 0.1 seconds. In this example, 0.7 seconds waiting isn't too long a time but if this information gives us some great ideas on where the problem is. If receive was longer than expected, then perhaps we've got a slow network, or one of the other network tuning issues in my blogs is causing it to take a long time. Let's imagine Wait is the longest (like below) but taking 10 seconds. This would indicate to me that perhaps the SharePoint server is taking a long time to construct a response, have a look at the URL, what is it doing? Is it a poorly performing script or similar?

    Alternatively this could have been caused by packet loss, perhaps the server didn't get my request for 9 seconds as we had to retransmit it? As we're using the professional edition we can see the local TCP port used for this connection in the columns and thus, we can (and I often do) take a simultaneous network trace and I can use this port information to isolate the TCP session that this GET request correlates to and look at the network performance. If there are retransmits, even with SSL they will be visible in the network trace, you just won't be able to see what call it was which was retransmitted.

    So, there are an array of tools which enable us to troubleshoot and baseline the application layer, I've only scratched the surface of what the tools can do but hopefully it gives you at least a starting point to look at application layer performance with O365 (or on prem for that matter).

    Paul

  • DNS geolocation for Office 365, connecting you to your nearest Datacenter for the fastest connectivity

    One of the main things we need to get right to ensure the most efficient and speedy connectivity to O365 is where in the world your DNS call is being completed. You'd think this wouldn't matter, you do a DNS lookup for your O365 tenant, get the address then connect right? Well, normally yes, but with O365, especially with Outlook, we do some pretty clever stuff to utilise our worldwide array of datacenters to ensure you get connected to your data as efficiently as possible.

    Your Outlook connection will do a DNS lookup and we use the location of that lookup to connect you to your nearest Microsoft Datacenter. With Outlook we'll connect to a CAS server there and use our fast Datacenter to datacenter backbone network to connect you to the datacenter where your exchange servers (and data) are located. This generally works much quicker than a direct connection to the datacenter where your tenant is located due to the speed of the interconnecting networks we have.

    http://technet.microsoft.com/en-us/library/dn741250.aspx outlines this in more detail but a diagram nicked from this post shows how this works for Outlook/Exchange connectivity when the Exchange mailbox is located in a NA datacenter but the user is physically located in EMEA. Therefore the DNS lookup is performed in EMEA, we connect to the nearest EMEA datacenter, which then routes the connection through to your mailbox over our backbone network, all in the background and your Outlook client knows nothing about this magic going on behind the scenes.

     

    If your environment is making its DNS calls in a location on a different continent to where the user is physically located then you are going to get really bad performance with O365. Take an example where the user and Mailbox is located in EMEA. Your company uses DNS servers located in the USA for all calls, or the user is incorrectly set to use a proxy server in the USA, thus we're given the IP address of a USA based datacenter as that's where we think your user is located. The client will then connect to the USA based datacenter which will route the traffic to the EMEA datacenter which will then send the response back to the USA based datacenter which will then respond to the client back in EMEA. So with this scenario we've got several unnecessary trips across the pond with our data.

    It is therefore vitally important to get the DNS lookup right for when you move to Outlook on Office 365.

    So how do you check this? Well it could be a bit tricky as although we release a list of IP addresses used for O365, we don't tell you which ones map to where, for many reasons including the fact they change regularly. Thankfully one of my Microsoft colleagues has shown me an easy way to check you're connecting to a local datacenter.

    All you need to do is open a command prompt on the client and ping outlook.office365.com and the response will tell you where the datacenter is you'll connect to. So sat here in the UK at home, I get EMEAWEST

     

    If I connect to our Singapore VPN endpoint and turn off split tunnelling and force the DNS call down the VPN link (our Internal IT do a great job of making these things configurable for us techies) then I get directed to apacsouth.

    And if I connect via VPN to the mothership in Seattle, my DNS call is completed there and thus I get directed to namnorthwest.

    So it's a quick and easy check, just make sure the datacenter returned is in the same region as you're physically located in.

    SharePoint is currently directed to the datacenter where your tenant is located so it doesn't matter so much where the call is made for this (although it should still preferably be local to the user for the portal connection). Lync is slightly different and is outlined in this article in more detail.

    It's also worth ensuring all your clients are using a proxy in the same region as where they are located, as if not, they could hit the problem outlined above and thus be getting unnecessarily poor O365 performance.

  • Checking your TCP Packets are pulling their weight (TCP Max Segment Size or MSS)

    This is a quick one to check to ensure your TCP packets are able to contain the maximum amount of data possible, low values in this area will severely affect network performance.

    Maximum Segment size or MSS is a TCP level value which is the largest segment which can be sent on the link minus the headers. To obtain this value take the IP level Maximum Transmission Unit (MTU) and subtract the IP and TCP header size.

    So for a standard Ethernet connection with minimum size IP and TCP headers we subtract 40 bytes from the 1500 byte standard packet size (minus the Ethernet Header) leaving us with an MSS of 1460 bytes for data transmission.

    So to get the most efficient use of a standard Ethernet connection we want to see an MSS of 1460 bytes being used on our TCP sessions.

    This setting is agreed in the TCP 3-way handshake when a TCP session is set up. Both sides send an MSS value and the lower of the two is used for the connection.

    It's easy to check this, take a Netmon or Wireshark trace and find the connection you're interested in, Netmon will filter the connections by process on the left hand side for you.

    Once you've found the connection (ensuring you've started tracing before initiating the connection) then you just need to open the first to frames of the connection, the SYN & SYN ACK. Indicated by an S followed by an A..S in the description of the frame. To capture the 3-way handshake make sure you start tracing, then start Outlook, or connect to your SharePoint site in a new Browser window.

    Once you've clicked on the first packet, the SYN, then in the frame details down on the bottom, open up TCP Options and the MSS can be clearly seen.

    Here we see the MaxSegmentSize shown as 1460.

     

    Repeat this with the SYN ACK which should be the second frame if you've filtered the connection away from other traffic. The lower of the two values will be your MSS. If it's 1460 then you're configured to use a full sized data payload.

    One caveat to this, it doesn't mean that this value can actually be used, it's possible a network segment along the route has a lower MTU than we're aware of. If this is the case, if all is well we'll get an ICMP message back from the router at the edge of this link when we send a 1460 byte packet with the do not fragment bit set. This packet will tell us what the MTU is on the link and we'll adjust accordingly. However it's always worth checking this value is set to a high value and we can see the TCP payload throughout the trace is at 1460 (on full packets) and hasn't dropped down to a lower value.

    It's common to see this value lower than the maximum of 1460 (for an Ethernet network), if for example we know a network segment along the route has a lower MTU, one with an encryption overhead for example, but the value shouldn't be significantly lower. 576 Byte packets are a sure sign we've hit problems and dropped down to the minimum packet size so keep an eye out for those.

    Also, remember, if you're using a proxy, you'll have to check this both on the client, and a trace on the proxy or NAT device if used as there will be two distinct TCP sessions in use and you won't see the problem if it is beyond the proxy/NAT unless you trace there for that second TCP connection.

    It's rare to see an issue with this, but it's always worth a quick check to ensure it's working as expected.

  • Ensuring your TCP stack isn’t throwing data away

    Fw

    In my previous blog post, I discussed checking the MSS to ensure full sized packets are used. Well, whilst you're digging around in the TCP Options of the SYN-SYN/ACK packets, it's worth checking another option SACK or Selective Acknowledgement.

    As you most likely know, TCP is a reliable protocol, in that it ensures delivery of all data. It does this by the ACK's indicating it's received up to a certain point in the data stream. This data stream is essentially a sequence of numbers, called….the sequence numbers.

    As an example, if we send 1460 bytes and our last sequence number was 40000 then the ack sent back to the machine which sent those 1460 bytes, will be 41460 and so on, as the sequence number is incremented by the byte size received and thus the sender knows the data arrived safely.

    However, we generally send a burst of these packets and the receiver acks every other one, what happens if we send 6 packets and packet 3 goes missing en route? Let's call these packets 1,2,3,4,5 & 6. If we receive packets 1,2,4,5,6 without SACK we'd have to drop packets 5 & 6 and ack 2 to indicate to the sender that that's the point we'd got up to until we'd noticed a packet missing. The sender would then have to retransmit packet #3 followed by 4,5 & 6 which obviously isn't efficient as we'd already received them but had to drop them. This also takes time and thus slows data transfer.

    With SACK enabled we're able to tell the sender we're missing a packet and also what other packets we've got. So in essence we can say to the sender, "Hey, I've got packets 1-2, and also 4,5 & 6" the sender can therefore retransmit just packet #3 and thus we save having to retransmit 4,5,6 (and any other subsequent packets which arrived before the retransmission of 3 arrived).

    Hope that explanation makes sense for the purposes of this, obviously the real implementation is a little more detailed, if you can't sleep then the detailed RFC is here

    This greatly increases the efficiency of the TCP protocol and is therefore enabled by default in Windows and most other TCP implementations. However, there can be occasions where devices are disabling this feature so it's always worth a quick check.

    As with the Scale Factor, MSS and Scale Factor, this setting is negotiated in the SYN and SYN/ACK packets and can be found in the TCP options area of the packets. If you're using a proxy or NAT device, it's worth tracing on the egress point to ensure the TCP connection outside your environment also has the setting enabled.

    Ensure this is enabled on both the Syn & SYN ACK, and you're good to go!

  • Ensuring your Proxy server can scale to handle Office 365 traffic

    Proxy servers are often in place at customer sites, happily ticking away handling Internet traffic for years before Office 365 came along. As Office 365 generally travels over port 443 (for Outlook and SharePoint at least) then what's to think about? Your proxy can handle this like any other SSL traffic right?

    Well, yes technically speaking this is indeed the case, but one thing you need to consider is the way Office 365 connects, it uses multiple, long life connections. This is not the same as normal web browsing as these sessions tend to be multiple yes, but not long life, they are generally torn down after the page is loaded/finished with. Also they aren't all going to the same remote IP address. So we've got to take into account both that each user will be using more, multiple TCP sessions than previously and that those sessions will in some cases be kept open for an extensive period of time (i.e. Outlook connections).

    This Article outlines the expected number of TCP connections for older versions of Outlook. You can see in the table below, in Cached mode 8 connections per client is possible. I've seen more than this when you add multiple mailboxes and calendars (think your Exec PA's). Generally the newer versions of Outlook use a lower number of connections as they are designed with the Cloud in mind, but again, power users can push the number of connections up above the norm.

     

    Let's take an example, Contoso has a single Proxy with a single IP, which has been working fine for years. They introduce Office 365 gradually for 6000 clients, including Outlook and SharePoint

    Whilst the proxy server is able to cope with the load at present, it is presenting itself to Office 365 via a single IP address.

    Using the calculations outlined in this article we believe an absolute maximum of 6000 clients can be supported by the current setup although I would err on the side of caution and estimate this to be nearer 4000. This issue stems from the available ephemeral ports available to connect to Office 365. Outlook can, and does open many connections per user.

     

    • Maximum supported devices behind a single public IP address = (64,000 – restricted ports)/(Peak port consumption + peak factor)
    • For instance, if 4,000 ports were restricted for use by Windows and 6 ports were needed per device with a peak factor of 4:
    • Maximum supported devices behind a single public IP address = (64,000 – 4,000)/(6 + 4)= 6,000

       

    So Contoso here would find that with 6000 clients running Outlook 2007, not only would Office 365 connections start to fail at random as we approached the limit, general Internet connections would start to fail as there are no resources available, and the proxy would be under enormous load. This because the normal internet traffic is going through the proxy and we're using many thousands of long lasting connections to Office 365, from a single IP. Using a more modern Outlook client may give you some more leeway in this scenario but you're still sailing close to the wind with the proxy's limitations when handling Outlook, SharePoint plus normal web traffic.

    Although Microsoft recommend a proxy is not used and traffic for office 365 is sent direct due to this, and performance concerns, we are aware this is not an easy solution for many customers who prefer to use a proxy.

    The article below outlines a solution to this problem by segmenting the network to multiple proxies. Another might be to load balance multiple proxies, however the load balancer would have to ensure stickiness to the client as every connection from Outlook to Office 365 needs to come from a single IP.

    http://technet.microsoft.com/en-us/library/hh852542.aspx

    So in summary, it's wise to check how many clients you've got connecting to Office 365 and ensure you have enough proxies, and IP addresses on those proxies to be able to scale to the number of ports required whilst still efficiently serving normal internet traffic. Don't presume your faithful old proxy is going to be able to handle the load, and new type of long standing TCP connections that Office 365 uses alongside its normal handling of other web traffic.