On The Wire

A detailed look inside the Ethernet Cable

On The Wire

Posts
  • How to analyse Application level performance for Outlook and SharePoint online

    If we've stepped through all the network level checks and all looks good from that perspective, then we need to move up the stack to the application itself and see if something above the network is causing performance issues.

    This can prove tricky with Office 365 as the information is almost always encrypted within an SSL session, however there are a number of methods we can use to look at how the application itself is performing and how long the requests we send to the remote server are taking to get a response. From this we can see if we've got a problem with the client or on the Datacenter side.

    When we're working on-prem this is a lot easier. We can normally match up a request to a response as long as it's in the clear by using a network packet sniffer such as Netmon or WireShark, so we'll be able to see an RPC call to its response in Outlook and a HTTP GET to its response in SharePoint. When these requests are encrypted then that becomes impossible. So how do we do it? Well the methods vary with each product:

    I've tried my best to make the images viewable without clicking out, any smaller ones should have a link to view full page if you need though.

    Outlook

    Outlook performance can be a tricky one at the best of times, especially so when using HTTPS.

    We can however use the inbuilt connection status tool to look at the performance. Ctrl + Right click the Outlook icon in the bottom right of your task bar and click on 'Connection Status'. This will give you a whole heap of information on Outlook's connectivity.

    Here you can see the Outlook 2013 output (the format and content varies with versions of Office)

    From here I can see how many connection to Exchange I've got, the type, and some information on the RTT and processing time (Avg Resp & Avg Proc respectively). We can use these two values together to see the RTT as measured by Outlook. If we take the cached connection with 4963 requests then we have the following:

    • Avg Resp: 29
    • Avg Proc: 6
    • Avg Resp shows us the RTT measured by Outlook.
    • Avg Proc shows us the processing time, how long the RPC processing latency is, how long the server took to construct the response. If this is high it indicates a problem on the server side.

    By subtracting 6 from 29 here, we get the latency which is 23ms

    To confirm this I can use PSPING to connect to the mailbox and this shows an average of 20ms.

    A great blog by a Microsoft colleague here describes these tests in more (and better) detail but this is a good test with inbuilt tools to show if we have latency or a delay on the Exchange side. The blog also outlines some great steps to take to look at other ways at the client and Outlook performance to see if there are any issues there

    I normally also take a network trace whilst starting Outlook, then whilst performing actions such as opening a new mailbox, switching calendars, sending a large mail and then analysing the traffic for symptoms described in my blog post.

    If you're running a newer version of Windows, you can also use Resource monitor to get a view on your round trip time for Outlook connections.

     

    SharePoint:

    There are a number of tools we can use to look at the page load performance within the browser.

    If you're using IE then the inbuilt tools are a good starting point, especially with the newer versions of the browser.

    By hitting F12 and using the inbuilt tools to trace the page load, we get information on each element of the page and how long it took to load and how big it is.

    Here we see the URLS opened and the HTTP response code.

    IE F12 Tools:

     

     

    And over to the left of the same screen we get more information on how long it took for this to complete.

     

    Fiddler:

    Fiddler is a tool which inserts itself in front of the browser and allows us to capture encrypted requests in the clear and shows us the time it takes for each request to complete and allows us to spot any problem elements of the page which is slowing us up.

    Here you can see information similar to that of the F12 tool but with more data.

     

    Over to the right when we click on a URL we see detailing information on how long each stage of this connection tool. For example we have information on how long it took to get the server response, and to complete this response. This sort of information is enough to give us an indication on a slow loading SharePoint page how quickly we got the initial response and subsequently all the data.

     

    If we select multiple URLs and click on Timeline we can see a graphical view of how long each stage took.

     

    This is just intended as an introduction to the tool, the help file is pretty good as is the support community on the website, and there is a book if you're keen. As this is a third party tool (i.e. non Microsoft) I can't vouch for it but I know we use elements of the tool in our new Message Analyzer tool.

    HTTPWatch

    My personal tool of choice however is HTTPWatch. This is a free tool which works with most browsers but you'll have to buy the full version if you want the extended features but is well worth it if you do this on a regular basis and your boss will stump up for it!

    This essentially acts as a proxy in front of your browser and allows it to see the elements of the page as they load and for me is the easiest to use and understand whilst giving me some great information on what the performance is like. Again, this is a third party tool and I can't vouch for it but we do use the full version within Microsoft and I personally use it extensively.

    I'll use IE as the example browser here but the tool works with Firefox on Windows. Once installed, if you hit F4 to get the menu up, you should see HTTPwatch as an option, click that and a window should open up at the bottom of the tab.

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    Hit record then enter your URL. Here I opened my test SharePoint page in Office 365. You can see clearly in the timechart which section took the longest to load (the one highlighted took 0.9 seconds) and if you had a poorly performing section of the page, this would be as clear as day in this timechart. I've also hovered over the green line which indicated when the page's rendering started in IE. So I can clearly see here that at 1.4 seconds after entering the URL the page was visible to the user (although some elements were still coming in, in the background).

    In addition to troubleshooting slow elements of a page, you can also can use this green line to measure a baseline of page load times, either for comparative purposes to an on-prem solution or perhaps before and after a network or page structure change. It's also useful to be able to compare page load performance from different sites.

    If I then click on the time chart for that URL that took 0.9 seconds I can see where that time was spent in more detail. Here we can see the connection and SSL handshake took no time at all, but we spend most of our time waiting for a response from the server. Once we get the response, we receive the data in 0.1 seconds. In this example, 0.7 seconds waiting isn't too long a time but if this information gives us some great ideas on where the problem is. If receive was longer than expected, then perhaps we've got a slow network, or one of the other network tuning issues in my blogs is causing it to take a long time. Let's imagine Wait is the longest (like below) but taking 10 seconds. This would indicate to me that perhaps the SharePoint server is taking a long time to construct a response, have a look at the URL, what is it doing? Is it a poorly performing script or similar?

    Alternatively this could have been caused by packet loss, perhaps the server didn't get my request for 9 seconds as we had to retransmit it? As we're using the professional edition we can see the local TCP port used for this connection in the columns and thus, we can (and I often do) take a simultaneous network trace and I can use this port information to isolate the TCP session that this GET request correlates to and look at the network performance. If there are retransmits, even with SSL they will be visible in the network trace, you just won't be able to see what call it was which was retransmitted.

    So, there are an array of tools which enable us to troubleshoot and baseline the application layer, I've only scratched the surface of what the tools can do but hopefully it gives you at least a starting point to look at application layer performance with O365 (or on prem for that matter).

    Paul

  • DNS geolocation for Office 365, connecting you to your nearest Datacenter for the fastest connectivity

    One of the main things we need to get right to ensure the most efficient and speedy connectivity to O365 is where in the world your DNS call is being completed. You'd think this wouldn't matter, you do a DNS lookup for your O365 tenant, get the address then connect right? Well, normally yes, but with O365, especially with Outlook, we do some pretty clever stuff to utilise our worldwide array of datacenters to ensure you get connected to your data as efficiently as possible.

    Your Outlook connection will do a DNS lookup and we use the location of that lookup to connect you to your nearest Microsoft Datacenter. With Outlook we'll connect to a CAS server there and use our fast Datacenter to datacenter backbone network to connect you to the datacenter where your exchange servers (and data) are located. This generally works much quicker than a direct connection to the datacenter where your tenant is located due to the speed of the interconnecting networks we have.

    http://technet.microsoft.com/en-us/library/dn741250.aspx outlines this in more detail but a diagram nicked from this post shows how this works for Outlook/Exchange connectivity when the Exchange mailbox is located in a NA datacenter but the user is physically located in EMEA. Therefore the DNS lookup is performed in EMEA, we connect to the nearest EMEA datacenter, which then routes the connection through to your mailbox over our backbone network, all in the background and your Outlook client knows nothing about this magic going on behind the scenes.

     

    If your environment is making its DNS calls in a location on a different continent to where the user is physically located then you are going to get really bad performance with O365. Take an example where the user and Mailbox is located in EMEA. Your company uses DNS servers located in the USA for all calls, or the user is incorrectly set to use a proxy server in the USA, thus we're given the IP address of a USA based datacenter as that's where we think your user is located. The client will then connect to the USA based datacenter which will route the traffic to the EMEA datacenter which will then send the response back to the USA based datacenter which will then respond to the client back in EMEA. So with this scenario we've got several unnecessary trips across the pond with our data.

    It is therefore vitally important to get the DNS lookup right for when you move to Outlook on Office 365.

    So how do you check this? Well it could be a bit tricky as although we release a list of IP addresses used for O365, we don't tell you which ones map to where, for many reasons including the fact they change regularly. Thankfully one of my Microsoft colleagues has shown me an easy way to check you're connecting to a local datacenter.

    All you need to do is open a command prompt on the client and ping outlook.office365.com and the response will tell you where the datacenter is you'll connect to. So sat here in the UK at home, I get EMEAWEST

     

    If I connect to our Singapore VPN endpoint and turn off split tunnelling and force the DNS call down the VPN link (our Internal IT do a great job of making these things configurable for us techies) then I get directed to apacsouth.

    And if I connect via VPN to the mothership in Seattle, my DNS call is completed there and thus I get directed to namnorthwest.

    So it's a quick and easy check, just make sure the datacenter returned is in the same region as you're physically located in.

    SharePoint is currently directed to the datacenter where your tenant is located so it doesn't matter so much where the call is made for this (although it should still preferably be local to the user for the portal connection). Lync is slightly different and is outlined in this article in more detail.

    It's also worth ensuring all your clients are using a proxy in the same region as where they are located, as if not, they could hit the problem outlined above and thus be getting unnecessarily poor O365 performance.

  • Checking your TCP Packets are pulling their weight (TCP Max Segment Size or MSS)

    This is a quick one to check to ensure your TCP packets are able to contain the maximum amount of data possible, low values in this area will severely affect network performance.

    Maximum Segment size or MSS is a TCP level value which is the largest segment which can be sent on the link minus the headers. To obtain this value take the IP level Maximum Transmission Unit (MTU) and subtract the IP and TCP header size.

    So for a standard Ethernet connection with minimum size IP and TCP headers we subtract 40 bytes from the 1500 byte standard packet size (minus the Ethernet Header) leaving us with an MSS of 1460 bytes for data transmission.

    So to get the most efficient use of a standard Ethernet connection we want to see an MSS of 1460 bytes being used on our TCP sessions.

    This setting is agreed in the TCP 3-way handshake when a TCP session is set up. Both sides send an MSS value and the lower of the two is used for the connection.

    It's easy to check this, take a Netmon or Wireshark trace and find the connection you're interested in, Netmon will filter the connections by process on the left hand side for you.

    Once you've found the connection (ensuring you've started tracing before initiating the connection) then you just need to open the first to frames of the connection, the SYN & SYN ACK. Indicated by an S followed by an A..S in the description of the frame. To capture the 3-way handshake make sure you start tracing, then start Outlook, or connect to your SharePoint site in a new Browser window.

    Once you've clicked on the first packet, the SYN, then in the frame details down on the bottom, open up TCP Options and the MSS can be clearly seen.

    Here we see the MaxSegmentSize shown as 1460.

     

    Repeat this with the SYN ACK which should be the second frame if you've filtered the connection away from other traffic. The lower of the two values will be your MSS. If it's 1460 then you're configured to use a full sized data payload.

    One caveat to this, it doesn't mean that this value can actually be used, it's possible a network segment along the route has a lower MTU than we're aware of. If this is the case, if all is well we'll get an ICMP message back from the router at the edge of this link when we send a 1460 byte packet with the do not fragment bit set. This packet will tell us what the MTU is on the link and we'll adjust accordingly. However it's always worth checking this value is set to a high value and we can see the TCP payload throughout the trace is at 1460 (on full packets) and hasn't dropped down to a lower value.

    It's common to see this value lower than the maximum of 1460 (for an Ethernet network), if for example we know a network segment along the route has a lower MTU, one with an encryption overhead for example, but the value shouldn't be significantly lower. 576 Byte packets are a sure sign we've hit problems and dropped down to the minimum packet size so keep an eye out for those.

    Also, remember, if you're using a proxy, you'll have to check this both on the client, and a trace on the proxy or NAT device if used as there will be two distinct TCP sessions in use and you won't see the problem if it is beyond the proxy/NAT unless you trace there for that second TCP connection.

    It's rare to see an issue with this, but it's always worth a quick check to ensure it's working as expected.

  • Ensuring your TCP stack isn’t throwing data away

    Fw

    In my previous blog post, I discussed checking the MSS to ensure full sized packets are used. Well, whilst you're digging around in the TCP Options of the SYN-SYN/ACK packets, it's worth checking another option SACK or Selective Acknowledgement.

    As you most likely know, TCP is a reliable protocol, in that it ensures delivery of all data. It does this by the ACK's indicating it's received up to a certain point in the data stream. This data stream is essentially a sequence of numbers, called….the sequence numbers.

    As an example, if we send 1460 bytes and our last sequence number was 40000 then the ack sent back to the machine which sent those 1460 bytes, will be 41460 and so on, as the sequence number is incremented by the byte size received and thus the sender knows the data arrived safely.

    However, we generally send a burst of these packets and the receiver acks every other one, what happens if we send 6 packets and packet 3 goes missing en route? Let's call these packets 1,2,3,4,5 & 6. If we receive packets 1,2,4,5,6 without SACK we'd have to drop packets 5 & 6 and ack 2 to indicate to the sender that that's the point we'd got up to until we'd noticed a packet missing. The sender would then have to retransmit packet #3 followed by 4,5 & 6 which obviously isn't efficient as we'd already received them but had to drop them. This also takes time and thus slows data transfer.

    With SACK enabled we're able to tell the sender we're missing a packet and also what other packets we've got. So in essence we can say to the sender, "Hey, I've got packets 1-2, and also 4,5 & 6" the sender can therefore retransmit just packet #3 and thus we save having to retransmit 4,5,6 (and any other subsequent packets which arrived before the retransmission of 3 arrived).

    Hope that explanation makes sense for the purposes of this, obviously the real implementation is a little more detailed, if you can't sleep then the detailed RFC is here

    This greatly increases the efficiency of the TCP protocol and is therefore enabled by default in Windows and most other TCP implementations. However, there can be occasions where devices are disabling this feature so it's always worth a quick check.

    As with the Scale Factor, MSS and Scale Factor, this setting is negotiated in the SYN and SYN/ACK packets and can be found in the TCP options area of the packets. If you're using a proxy or NAT device, it's worth tracing on the egress point to ensure the TCP connection outside your environment also has the setting enabled.

    Ensure this is enabled on both the Syn & SYN ACK, and you're good to go!

  • Ensuring your Proxy server can scale to handle Office 365 traffic

    Proxy servers are often in place at customer sites, happily ticking away handling Internet traffic for years before Office 365 came along. As Office 365 generally travels over port 443 (for Outlook and SharePoint at least) then what's to think about? Your proxy can handle this like any other SSL traffic right?

    Well, yes technically speaking this is indeed the case, but one thing you need to consider is the way Office 365 connects, it uses multiple, long life connections. This is not the same as normal web browsing as these sessions tend to be multiple yes, but not long life, they are generally torn down after the page is loaded/finished with. Also they aren't all going to the same remote IP address. So we've got to take into account both that each user will be using more, multiple TCP sessions than previously and that those sessions will in some cases be kept open for an extensive period of time (i.e. Outlook connections).

    This Article outlines the expected number of TCP connections for older versions of Outlook. You can see in the table below, in Cached mode 8 connections per client is possible. I've seen more than this when you add multiple mailboxes and calendars (think your Exec PA's). Generally the newer versions of Outlook use a lower number of connections as they are designed with the Cloud in mind, but again, power users can push the number of connections up above the norm.

     

    Let's take an example, Contoso has a single Proxy with a single IP, which has been working fine for years. They introduce Office 365 gradually for 6000 clients, including Outlook and SharePoint

    Whilst the proxy server is able to cope with the load at present, it is presenting itself to Office 365 via a single IP address.

    Using the calculations outlined in this article we believe an absolute maximum of 6000 clients can be supported by the current setup although I would err on the side of caution and estimate this to be nearer 4000. This issue stems from the available ephemeral ports available to connect to Office 365. Outlook can, and does open many connections per user.

     

    • Maximum supported devices behind a single public IP address = (64,000 – restricted ports)/(Peak port consumption + peak factor)
    • For instance, if 4,000 ports were restricted for use by Windows and 6 ports were needed per device with a peak factor of 4:
    • Maximum supported devices behind a single public IP address = (64,000 – 4,000)/(6 + 4)= 6,000

       

    So Contoso here would find that with 6000 clients running Outlook 2007, not only would Office 365 connections start to fail at random as we approached the limit, general Internet connections would start to fail as there are no resources available, and the proxy would be under enormous load. This because the normal internet traffic is going through the proxy and we're using many thousands of long lasting connections to Office 365, from a single IP. Using a more modern Outlook client may give you some more leeway in this scenario but you're still sailing close to the wind with the proxy's limitations when handling Outlook, SharePoint plus normal web traffic.

    Although Microsoft recommend a proxy is not used and traffic for office 365 is sent direct due to this, and performance concerns, we are aware this is not an easy solution for many customers who prefer to use a proxy.

    The article below outlines a solution to this problem by segmenting the network to multiple proxies. Another might be to load balance multiple proxies, however the load balancer would have to ensure stickiness to the client as every connection from Outlook to Office 365 needs to come from a single IP.

    http://technet.microsoft.com/en-us/library/hh852542.aspx

    So in summary, it's wise to check how many clients you've got connecting to Office 365 and ensure you have enough proxies, and IP addresses on those proxies to be able to scale to the number of ports required whilst still efficiently serving normal internet traffic. Don't presume your faithful old proxy is going to be able to handle the load, and new type of long standing TCP connections that Office 365 uses alongside its normal handling of other web traffic.

  • Top 10 Tips for Optimising & Troubleshooting your Office 365 Network Connectivity

    Having performed numerous Office 365 Network assessments and reactive visits to resolve issues for customers, its apparent that the vast majority of issues are seen time and time again. So from this experience, here are my top 10 tips for you to optimise your O365 network performance and prevent issues occurring in future.

    Some of these issues, if occurring, will cost you seconds, others more, but by eliminating them all and getting your proverbial network ducks in a row, you'll ensure you are providing the best possible Office 365 experience for your users.

    As some of these issues are complex, rather than provide a detailed explanation for each scenario in this blog post, I've linked to separate blog posts which cover each in detail as and when you need them.

    1. TCP Window Scaling

    This tops my list of things to check due to the impact it can have on performance and the amount of times I see it disabled on legacy network equipment. Unfortunately there is no simple way of checking this from a client without taking a network trace but I've outlined how to do this, and the setting in much more detail in a separate blog post here

    If you check only one thing on your O365 network link, then I'd advise it's this!

    2. TCP Idle time settings

    This issue is another very common one and is caused by settings on egress points of corporate networks not being adjusted for Outlook running through it. Problems caused by this can include hangs within Outlook, especially when switching mailboxes/calendars and unexpected auth prompts. It's relatively easy to fix and I've outlined the problem, and solution in much more detail here.

    3. Latency/Round Trip Time (RTT)

    Network latency has the ability to cause real issues with O365 and it's usability. Checking your RTT to O365 is a worthwhile task regardless of whether you're having issues as it provides you with a great baseline should performance issues occur in future allowing you to isolate where the delay is occurring.

    The detailed guide on how to do this is here

    4. Proxy Authentication

    On numerous occasions I've run into unnecessary delays on connecting to O365 caused by proxy authentication. With Outlook this can cause a delay on start up, when switching mailboxes/calendars etc, anything that requires a new TCP connection to be spun up. With SharePoint this will manifest itself as slow initial page loads.

    The detailed guide on how to check this is here

    5. DNS performance

    A simple one but often forgotten. DNS performance should be checked to ensure it isn't adding additional delays to your Office 365 connections. A detailed guide to checking DNS performance can be found here.

    6. Proxy Scalability

    This issue can effect both performance and cause issues further down the line when you least expect it. Proxies are invariably in place before the move to Office 365 and are often used without much reconfiguration for the Office 365 traffic. It's worth checking your numbers here as you may find you're sailing closer to the wind than you realise.

    The more detailed description of the problem and guidance can be found here

    7. TCP Max Segment size

    A simple one to check but worth a look none the less. To ensure maximum throughput on the link between yourself and Office 365 we should be using as close to as possible the maximum TCP segment size for transferring data.

    More detail on how to check this is here

    8. Selective Acknowledgement

    Whilst you're digging around in the TCP Options in your 3-way handshake, it's worth checking Selective Acknowledgement (SACK) is enabled. This feature enables your TCP stack to deal with dropped packets more efficiently.

    A slightly more detailed explanation of SACK and how to check it can be found here.

    9. DNS Geo location

    One of the most important checks you can make, and one that can make a big difference in the performance of O365 is ensuring your DNS call are made in the same geographic location as the user is actually in. Getting this wrong means that the routing of your traffic to O365 could be sub optimal and thus affect performance. It's thankfully an easy one to check though and outlined further here.

    10. Application Level troubleshooting

    My final tip isn't so much with network troubleshooting but application layer. This blog post will give you some tips on how to look at Outlook and SharePoint in conjunction with network tracing to both baseline and troubleshoot application level issues even when the traffic is encrypted in an SSL session.

    The blog post is here

    I'll add more as and when I find them/get time, hope the first ten are of help!

  • Checking your DNS performance isn’t delaying your O365 connections

     

    One of the initial things which should be checked is name resolution, a point which is often forgotten when doing performance tests. If the name resolution takes time, then this will manifest itself as initial page load time slowness in SharePoint. It's less visible with Outlook but little delays in DNS, and proxy authentication etc etc when added up can mean a poorly performing O365 infrastructure.

    Checking this is easy, and again involves a quick network capture on a test client

    It is always advisable to flush the DNS cache by running ipconfig /flushdns before taking any traces and the steps would be:

     

    • Install Netmon or Wireshark on a test client
    • Start tracing
    • Run ipconfig /flushdns to clear the DNS cache
    • Start Outlook or connect to your SharePoint site
    • Once connected stop the trace
    • Use the filter 'DNS' to show all DNS traffic in the capture tool.

     

    Netmon handily gives each DNS call and response (and any other protocol for that matter) a unique ID number which we can use to filter if we wish. Here it's DNS conv id 124 so I'd write conversation.DNS.ID==124 in the netmon filter to see just this DNS call and it's response. Alternatively you can right click over a frame of interest on a saved trace and click Find Conversations > DNS. Or, we could use the DNS query ID, in this case it'd be 'DNS.QueryIdentifier == 0x5b9f'

     

     

     

    In the following example we can see the Contoso DNS servers taking up to 3.7 seconds to respond to a DNS call. This would undoubtedly manifest itself as slowness in an initial connection to Office 365.

     

    13:52:52 16/04/2013 31.2765664 0.0000000 10.200.30.40 10.214.2.129 DNS:QueryId = 0xE41, QUERY (Standard query), Query for Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com of type A on class Internet

    13:52:56 16/04/2013 35.0579179 3.7813515 10.214.2.129 10.200.30.40 DNS:QueryId = 0xE41, QUERY (Standard query), Response - Success, 10.123.123.124 ...

    Another DNS server can be seen here responding in a slow manner

    13:52:54 16/04/2013 33.3042446 0.0000000 10.200.30.40 uk1.headoffdom.uk.Contoso.com DNS:QueryId = 0xE41, QUERY (Standard query), Query for Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com of type A on class Internet

    13:52:56 16/04/2013 35.0583415 1.7540969 uk1.headoffdom.uk.Contoso.com 10.200.30.40 DNS:QueryId = 0xE41, QUERY (Standard query), Response - Success, 10.123.123.124

    However, some other queries can be seen answered by the DNS server in a much faster manner:

    13:52:57 16/04/2013 35.6045648 0.0000000 10.200.30.40 10.214.2.129 DNS :QueryId = 0xE77C, QUERY (Standard query), Query for login.microsoftonline.com of type A on class Internet

    13:52:57 16/04/2013 35.6049028 0.0003380 10.214.2.129 10.200.30.40 DNS:QueryId = 0xE77C, QUERY (Standard query), Response - Success, 49, 0

    Under optimal conditions I would expect a return on a DNS call in less than 100ms. Ideally much less. Any delay in this phase would manifest itself as poor initial performance when loading a page. In theory (presuming we don't need to resolve any further addresses) the connectivity should be quicker once the initial page is loaded.

    If you see a slow response like the one above, it's worth first checking what the psping times to the DNS server on TCP port 53 (Most calls will be over UDP but the server should be listening on TCP 53 too). The method to do this is outlined here. If the PSPING time is similar to that seen in the DNS response, then it's possibly a network delay between you and the server. If it's much quicker consistently, its more likely an application (DNS) level issue you should investigate on the server and any forwarders if used.

  • Preventing proxy authentication from delaying your O365 connection

    A quick and easy check you can do to ensure your O365 connections complete quickly is to check proxy authentication is completing quickly, or better still not being done at all. If you're not using a proxy and are going direct, then you can move along…nothing to see here!

    It's surprisingly common and I've run into numerous customers experiencing unnecessary delays on connecting to O365, caused by proxy authentication. With Outlook this can cause a delay on start-up, or the 'polo mint' hang when switching mailboxes/calendars etc, anything that requires a new TCP connection to be spun up. With SharePoint this will manifest itself as slow initial page loads.

    To view this proxy authentication stage of a TCP connection to Office 365 (if indeed there is one) I'd recommend using a packet capture tool on the client, such as Netmon or Wireshark. If enabled, proxy authentication needs to occur with every TCP session setup and is the first thing which has to complete after the TCP 3 way handshake, usually triggered with the first GET or CONNECT request. We should expect this process to complete in milliseconds. This is what it looks like in Netmon:

    In Netmon it's wise to add the 'NTLM SSP Summary' column to show what stage of authentication we're at, also add the "Time Delta" column to show you the time delay from the previous packet shown.

    The easiest way to find the session is to

    • Close all browser windows and open a single one on a new tab.
    • For Outlook close the application completely
    • Start Netmon
    • Connect to your SharePoint page or start Outlook
    • On the left hand side, Netmon should show your browser (or Outlook) with its TCP connections to your site as follows:

     

    For each of these examples, you'll see multiple TCP sessions per process, each will perform proxy authentication if enabled. The number will differ on the version of Outlook or browser you are using.

    Initially we'll connect with no authentication and be told 'proxy authentication required'. In this example response in this instance takes 0.02 seconds to come back, indicating no network performance issue and no proxy performance issue per-se.

    Here is an example of this problem occurring. Showing the stages following the TCP 3-way handshake. To just see these packets and ignore the pure TCP ones, use the filter 'HTTP'.

     

    Initial connect:

    14:12:24.6483418 19.0046514 0.0003578 iexplore.exe 10.200.30.40 MyProxy-01.Contoso.sig HTTP:Request, CONNECT Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com:443 , Using NTLM Authorization NTLM NEGOTIATE MESSAGE

    Proxy Response:

    14:12:24.6876389 19.0439485 0.0283000 iexplore.exe MyProxy-01.Contoso.sig 10.200.30.40 HTTP:Response, HTTP/1.1, Status: Proxy authentication required, URL: Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com:443 NTLM CHALLENGE MESSAGE

    We then send the request again, this time with NTLM authentication for the proxy as requested:

    Second request with NTLM Auth:

    14:12:24.6883198 19.0446294 0.0004838 iexplore.exe 10.200.30.40 MyProxy-01.Contoso.sig HTTP HTTP:Request, CONNECT Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com:443 , Using NTLM Authorization NTLM AUTHENTICATE MESSAGE Version:NTLM v2, Domain: headoffdom, User: paul.collinge, Workstation: W7TEST20

    200 OK response from proxy but this takes 3 seconds.

    14:12:27.7859643 22.1422739 3.0878394 iexplore.exe MyProxy-01.Contoso.sig 10.200.30.40 HTTP HTTP:Response, HTTP/1.1, Status: Ok, URL: Contosoemeamicrosoftonlinecom-3.sharepoint.emea.microsoftonline.com:443

    Subsequent call:

    Once the above is complete, subsequent calls are back to millisecond response times.

    14:12:27.7868062 22.1431158 0.0008419 iexplore.exe 10.200.30.40 MyProxy-01.Contoso.sig TLS TLS:TLS Rec Layer-1 HandShake: Client Hello.

    14:12:27.8445642 22.2008738 0.0485304 iexplore.exe MyProxy-01.Contoso.sig 10.200.30.40 TLS TLS:TLS Rec Layer-1 HandShake: Server Hello.; TLS Rec Layer-2 Cipher Change Spec; TLS Rec Layer-3 HandShake: Encrypted Handshake Message

     

    As the delay is only seen when performing the proxy authentication stage of the session setup , this indicates the delay is caused by the process of authentication itself.

    As we're using NTLM in this example and in this instance the Proxy is in a different domain to the users, it's feasible this delay could be caused by congestion on the secure channels between the proxy and it's DC, or its DC to the users DC.

    This would have to be investigated separately to see if it this, or something else is causing the delay but here is some information on this and the fix (maxconcurrentapi registry key).

    http://blogs.technet.com/b/ad/archive/2008/09/23/ntlm-and-maxconcurrentapi-concerns.aspx

    http://support.microsoft.com/kb/975363

    Regardless of the root cause, this behaviour, if occurring will be causing intermittent performance issues with both Office 365 and Internet browsing. When the above is the cause, you may find that the response times are good at times and very bad at others. The slow times seemed may coincide with high utilisation times, such as first time in the morning and after lunch and as such you should run these tests at various times of the day.

    The issue could also be occurring due to loading issues on the proxy itself, however, delays would be apparent in more than just the authentication packets.

     

    From an Office 365 perspective, we can easily remove this problem by following the recommended setup and making an exception in the proxy for authentication on the Office 365 urls as per: http://support.microsoft.com/kb/2637629

     

    Firewall or proxy servers require additional authentication

    To resolve this issue, configure an exception for Microsoft Office 365 URLs and applications from the authentication proxy. For example, if you are running Microsoft Internet Security and Acceleration Server (ISA) 2006, create an "allow" rule that meets the following criteria:

    • Allow outbound connections to the following destination: *.microsoftonline.com
    • Allow outbound connections to the following destination: *.microsoftonline-p.com
    • Allow outbound connections to the following destination: *.sharepoint.com
    • Allow outbound connections to the following destination: *.outlook.com
    • Allow outbound connections to the following destination: *.lync.com
    • Allow outbound connections to the following destination: osub.microsoft.com
    • Ports 80/443
    • Protocols TCP and HTTPS
    • Rule must apply to all users.

    HTTPS/SSL time-out set to 8 hours

    With these bypassed, we've removed one possible cause of a delay on your Office 365 connections, and also taken away some load from both your proxies, and your DCs.

    So in summary, if you're using a proxy, ensure there is no authentication performed on the TCP session out to Office 365 by whitelisting the URLs above. If you must use proxy auth, make sure it's completing quickly, especially at peak times. It should complete in milliseconds, i.e. not much more than the time between the initial SYN and SYN ACK. Even a delay as small as 2-3 seconds like the one demonstrated, will have a noticeable impact for your users.

  • How to measure the Network Round Trip Time to Office 365

    One of the jobs I do fairly often is to help customers work out the performance of their network connection to office 365 from various sites to ensure it's within the limits which will give good user performance with Outlook, SharePoint etc.

    We have various tools available that do some level of network check for you such as http://em1-fasttrack.cloudapp.net/o365nwtest (for EMEA) but I tend to do my checks manually using a variety of tools as it gives me more granular detail, so here is how I do it. Beware, this tool uses Java too which may not be permitted on some customer sites.

     

    How do I find the IP address I need to connect to so as to check this?

    There are multiple ways we can get this information:

     

    • Ping the name you are trying to connect to.

     

    ping mytennant.sharepoint.com

    Pinging prodnet47-48ipv4a.sharepointonline.com.akadns.net [157.55.232.50] with 32 bytes of data:

    Alternatively, for Exchange, you can ping outlook.office365.com. Not only will this tell you the IP you'll hit, it'll also tell you which datacenter you are connecting to.

    This is a good check to do to ensure efficiency. We're pretty clever with connecting Outlook traffic in that we use geo-dns to find out where in the world you are, then we provide you with the DNS address of your nearest datacenter. You connect to a CAS server there which then will route you through our high speed backbone between the datacenters to wherever your mailbox is located.

    This means for example, that if your mailbox is located in an EMEA tennant, and you travel to the US. You'll be connected to a CAS server in a US datacenter which will then pull the data you need for Outlook over our network which should mean a much better service. http://technet.microsoft.com/en-us/library/dn741250.aspx outlines this in more detail.

    The problem with this is, if your DNS server (or your proxy for that matter) is located in a different region to where you are located, then this can mean you're being inefficiently routed so it's always a good check to run.

    For example, if I ping from Seattle I get

    Ping outlook.office365.com

    Pinging outlook-namnorthwest.office365.com [132.245.92.41] with 32 bytes of data

    Whereas here at home in the UK I get one of the EMEA datacenters.

    ping outlook.office365.com

    Pinging outlook-emeawest.office365.com [132.245.229.114] with 32 bytes of data:

     

    • Network Sniffing

       

      The other method is to use a packet sniffer to trace a connection occurring and this will show you the IP. This method also gives me the advantage of seeing if we're going via a proxy to connect out.

      Personally I use Netmon but other tools such as Wireshark will do the job well depending on your preference.

      Start the tool tracing (run as admin and attach to the correct NIC first) then either launch Outlook or open the SharePoint site you want to access.

      Once you've accessed the data you want, stop the trace and have a look in Netmon for the process you used. In this case it was IE connecting to my O365 SharePoint site.

      Firstly I can see the DNS call go out (use "DNS" in the filter column).

      189    15:56:56 08/05/2014    15:56:56 08/05/2014    11.6029790    8.3358920        192.168.0.8    192.168.0.1    DNS    DNS:QueryId = 0xA8BD, QUERY (Standard query), Query for mytennant.sharepoint.com of type A on class Internet    {DNS:55, UDP:54, IPv4:8}    

      203    15:56:57 08/05/2014    15:56:57 08/05/2014    12.1115268    0.0006276        192.168.0.1    192.168.0.8    DNS    DNS:QueryId = 0xA8BD, QUERY (Standard query), Response - Success, 49, 0 ...     {DNS:55, UDP:54, IPv4:8}

      ARecord: prodnet47-48ipv4a.sharepointonline.com.akadns.net of type A on class Internet: 157.55.232.50    

      In Netmon it handily breaks down the connections per process then connections made by that process. Here we can see IE is connecting to 157.55.232.50 on port 443 which is the IP address we got back for my SharePoint site. If you close all IE windows and just use one to connect, you know the IP address used is that to SharePoint.

       

      So now you have the IP address, and port used, now what do you do?

       

      Measuring the Round Trip Time to the address

      As ICMP is often blocked at firewalls, it's not an effective tool to use for this task.

      Thankfully we have an alternative, I use a great tool by Mark Russinovich in the Sysinternals Suite, PSPING. http://technet.microsoft.com/en-us/sysinternals/jj729731.aspx

      We can use this tool to measure latency internally (by running a server version on the remote end) but as this is O365 we can't do that but it does a great job of measuring the RTT on the three way handshake.

      What it does is complete a TCP connection to the IP address and port provided so we can accurately measure the time between the syn and syn ack on that connection. And as it will use a port we know is open, it won't be blocked by the firewall.

      The syntax for this is:

      psping -n 20 157.55.232.50:443

      So we're doing 20 pings, to the IP address derived above, on TCP port 443.

      PsPing v2.01 - PsPing - ping, latency, bandwidth measurement utility

      Copyright (C) 2012-2014 Mark Russinovich

      Sysinternals - www.sysinternals.com

       

      TCP connect to 157.55.232.50:443:

      21 iterations (warmup 1) connecting test:

      Connecting to 157.55.232.50:443 (warmup): 15.01ms
      

      Connecting to 157.55.232.50:443: 15.44ms
      

      Connecting to 157.55.232.50:443: 15.73ms
      

      Connecting to 157.55.232.50:443: 15.55ms
      

      Connecting to 157.55.232.50:443: 15.54ms
      

      Connecting to 157.55.232.50:443: 15.70ms
      

      Connecting to 157.55.232.50:443: 14.97ms
      

      Connecting to 157.55.232.50:443: 14.70ms
      

      Connecting to 157.55.232.50:443: 16.02ms
      

      Connecting to 157.55.232.50:443: 16.53ms
      

      Connecting to 157.55.232.50:443: 15.39ms
      

      Connecting to 157.55.232.50:443: 15.38ms
      

      Connecting to 157.55.232.50:443: 15.95ms
      

      Connecting to 157.55.232.50:443: 15.99ms
      

      Connecting to 157.55.232.50:443: 16.82ms
      

      Connecting to 157.55.232.50:443: 16.10ms
      

      Connecting to 157.55.232.50:443: 15.55ms
      

      Connecting to 157.55.232.50:443: 16.30ms
      

      Connecting to 157.55.232.50:443: 16.03ms
      

      Connecting to 157.55.232.50:443: 15.55ms
      

      Connecting to 157.55.232.50:443: 14.81ms
      

       

      TCP connect statistics for 157.55.232.50:443:

      Sent = 20, Received = 20, Lost = 0 (0% loss),

      Minimum = 14.70ms, Maximum = 16.82ms, Average = 15.70ms

      So we can see the RTT to my SharePoint site is an average of 15.70ms and the maximum is 16.82 meaning there is no real fluctuation in the RTT.

      Compare this to the network tool at http://em1-fasttrack.cloudapp.net/o365nwtest and it's almost identical. However by using PSPING it's easier to combine this with a network trace so we can check other network level settings such as MTU and TCP Window scaling.

       

       

      So what if I'm not going Direct and using a proxy?

      When using a proxy, this particular method comes into its own and it's a common scenario I see at customer sites.

      If we are connecting via a proxy server then we have to change tack a little as we're not communicating directly to the O365 node. We have a TCP connection to the proxy then another one is set up from the proxy out to the O365 endpoint.

      Firstly, we repeat the test we did above, but to the address of our proxy server. Once we have the RTT to the proxy, which is invariably at the network perimeter, we then need to get a reading from that point out to O365.

      We therefore run PSPING on the proxy, or a machine in front of the proxy which has a direct internet connection, to the IP address of the O365 endpoint. From these two figures we have an overall RTT and also, an idea where the majority of our latency is occurring.

      If you have the ability to bypass the proxy then repeat the test on the same client, but with the direct connection. You then subtract the RTT from the one you got to the proxy (presuming the exit point and routing are similar) and you'll have your internal and external RTT.

      If it's not possible to take a psping trace on the proxy then most proxies allow for a packet capture. We can use this to get the RTT by looking at the time delta between the syn and syn ack or any other packet which has a response we can match to the request such as the SSL handshake.

       

       

       

      Here we can see clearly, the poor RTT is outside the customer's environment, on the ISP link to Office 365. If this RTT is unexpected, the customer can engage their ISP to investigate.

      We've tested SharePoint on Office 365 to play nicely up to 300ms, and I've personally seen it work well at higher latency levels than this, as long as all other network settings (tcp window scaling, packet sizes, proxy performance etc) are all optimal. With Outlook the effect of latency is less pronounced as the software does a good job of masking it, but if it's too high, actions such as switching calendars/mailboxes may show a delay.

       

      So now I have my RTT, but what is a good one and bad?

      That's always the key question, but is always subjective, in short the answer is, it depends on the scenario. The longer the distance the higher it'll be.

      • Internal to your environment look for <100ms, ideally much less.
      • From UK site to EMEA Datacenter <100ms total should be the aim. Ideally much less than that. For example my home connection above is showing 16ms (admittedly there is very little of the network equipment in the way which you'll see on a corporate network)
      • Australia <>EMEA can be done in 300ms as a reference
      • Verizon have a handy table of latency between various endpoints on their network here http://www.verizonenterprise.com/about/network/latency/ which will give you a good idea of the latency you can expect between these points with such a carrier, and for you to compare yours to.
      • Having both internal and external RTT allows us to accurately identify if a network latency issue is inside your environment or outside.
      • It's useful to do this test as a baseline during normal operation so it can be referred to if issues occur we can then work out if any segment has increased RTT.

       

      What about application level latency?

      With this method we're only looking at network level latency. As the traffic to O365 will be encrypted you'll need other methods to measure application level latency.

      Outlook:

      For Outlook, the Software itself measures RPC latency and a Microsoft Colleague Neil Johnson has as great blog here which describes how to look at this:

      http://blogs.technet.com/b/neiljohn/archive/2012/01/23/outlook-performance-troubleshooting-including-office-365.aspx

      SharePoint:

      For SharePoint, we can utilise a couple of 3rd party tools to look at the encrypted application level requests in the clear.

      My tool of choice is the great HTTPWatch which plugs into your browser and shows the page load times in great detail, helping you troubleshoot which element of a page is taking a long time, showing you clearly the timing between a get request and response and the rendering time in the browser. I'll write a more detailed blog post on this tool when I get time.

      Another tool is Fiddler which also does a great job of showing you inside encrypted traffic. I tend not to use this often as I find it can alter the behaviour of the receiving application. For example if we're doing byte range requests for a file (which allows us to pause and resume downloads) then it can disable that feature and just send one get request for the file, which changes the user behaviour I'm trying to baseline/troubleshoot.

       

      That's all for now..I hope this helps with making sure your O365 solution is connecting as quickly as possible!

  • Ensuring your Office 365 network connection isn’t throttled by your Proxy

    One of the things I'm regularly running into when looking at performance issue with Office 365 customers is a network restricted by TCP Window scaling on the network egress point such as the Proxy used to connect to Office 365. This affects all connections through the device, be it to Azure, bbc.co.uk or google Bing. We'll concentrate on Office 365 here, but the advice applies regardless of where you're connecting to.

    I'll explain shortly what this is and why it's so important but to give you an idea of its impact, one customer I worked with recently saw the download time of a 14mb PDF from an EMEA tenant to an Australia (Sydney) based client improve from 500 seconds before, to 32 seconds after properly enabling this setting. Imagine that kind of impact on the performance of your SharePoint pages or pulling large emails in from Exchange online, it's a noticeable improvement which can change Office 365 in some circumstances from being frustrating to use, to being comparable to on-prem.

    Below is a visual representation of a real world example of the impact of TCP Window scaling on Office 365 use.

    The premise of this is quite complicated, hence the very wordy blog for those of you interested in the detail, but the resolution is easy, it's normally a single setting on a Proxy or NAT device so just skip to the end if you want the solution.

    What is TCP Window scaling?

    When the TCP RFC was first designed, a 16bit field in the TCP header was reserved for the TCP Window. This essentially being a receive buffer so you can send data to another machine up to the limit of this buffer, without waiting for an acknowledgement from the receiver to say it had received the data.

    This 16 bit field means the maximum value of this is 2^16 or 65535 bytes. I'm sure when this was envisaged, it was thought it would be difficult to send such an amount of data so quickly that we could saturate the buffer.

    However, as we now know, computing and networks have developed at such a pace that this value is now relatively tiny and it's entirely feasible to fill this buffer in a matter of milliseconds. When this occurs it causes the sender to back off sending until it receives an acknowledgement from the receiving machine which has the obvious effect of causing slow throughput.

    A solution was provided in RFC 1323 which describes the TCP window scaling mechanism. This is a method which allows us to assign a number in a 3 byte field which is in essence a multiplier of the TCP window size described above.

    It's referred to as the shift count (as we're essentially shifting bits right) and has a valid value of 0-14. With this value we can theoretically increase the TCP window size to around 1gb.

    The way we get to the actual TCP Window size is (2^Scalingfactor)*TCP Window Size so If we take the maximum possible values it would be (2^14)*65535 = 1073725440 bytes or just over 1gb.

     

    TCP Window Scaling enabled?

    Maximum TCP receive buffer (Bytes)

    No

    65535 (64k)

    Yes

    1073725440 (1gb)

     

    What is the impact of having this disabled?

    The obvious issue is that if we fill this buffer, the sender has to back off sending any more data until it receives an acknowledgement from the receiving node so we end up with a stop start data transfer which will be slow. However, there is a bigger issue and it's most apparent on high bandwidth, high latency connections, for example, an intercontinental link (from Australia to an EMEA datacentre for example) or a satellite link.

    To use a high bandwidth link efficiently we want to fill the connection with as much data as possible as quickly as possible. With a TCP Window size limited to 64k when we have Window Scaling disabled we can't get anywhere near filling this pipe and thus use all the bandwidth available.

    With a couple of bits of data we can work out exactly what the mathematical maximum throughput can be on the link.

    Let's assume we've got a round trip time (RTT) from Australia to the EMEA datacentre of 300ms (RTT being the time it takes to get a packet to the other end of the link and back) and we've got TCP Window Scaling disabled so we've got a maximum TCP window of 65535 bytes.

    So with these two figures, and an assumption the link is 1000Mbit/sec we can work out the maximum throughput by using the following calculation:

     

    Throughput = TCP maximum receive windowsize / RTT

     

    There are various calculators available online to help you check your maths with this, a good example which I use is http://www.switch.ch/network/tools/tcp_throughput/

     

    • TCP buffer required to reach 1000 Mbps with RTT of 300.0 ms >= 38400.0 KByte
    • Maximum throughput with a TCP window of 64 KByte and RTT of 300.0 ms <= 1.71 Mbit/sec.

     

    So we need a TCP Window size of 38400 Kbyte to saturate the pipe and use the full bandwidth, instead we're limited to <=1.71 Mbit/sec on this link due to the 64k window, which is a fraction of the possible throughput.

    The higher the round trip time the more obvious this problem becomes but as the table below shows, even with a RTT of 1 second (which is extremely good and unlikely to occur other than on the local network segment) we cannot fully utilise the 1000 Mbps on the link.

    Presuming a 1000 Mbps link here is the maximum throughput we can get with TCP window scaling disabled:

    RTT (ms)

    Maximum Throughput (Mbit/sec)

    300

    1.71

    200

    2.56

    100

    5.12

    50

    10.24

    25

    20.48

    10

    51.20

    5

    102.40

    1

    512.00

    So it's clear from this table that even with an extremely low RTT of 1ms, we simply cannot use this link to full capacity due to having window scaling disabled.

     

    What is the impact of enabling this setting?

    We used to see the occasional issue caused by this setting when older networking equipment didn't understand the Scaling Factor and this caused issues with connectivity. However it's very rare to see issues like this nowadays and thus it's extremely safe to have the setting enabled. Windows has this enabled by default since Windows 2000 (if my memory serves).

    As we're increasing the window size beyond 64k the sending machine can therefore push more data onto the network before having to stop and wait for an acknowledgement from the receiver.

    To compare the maximum throughput with the table above, this table shows a modest scale factor of 8 and a maximum TCP window of 64k.

    (2^8)*65535 = Maximum window size of 16776960 bytes

    The following table shows we can saturate the 1000 Mbps link at a 100ms RTT when we have window scaling enabled. Even on a 300ms RTT we're still able to achieve a massively greater throughput than with window scaling disabled.

    RTT (ms)

    Maximum Throughput (Mbit/sec)

    300

    447.36

    200

    655.32

    100

    1310.64

    50

    2684.16

    25

    5368.32

    10

    13420.80

    5

    26841.60

    1

    134208.00

     

    In reality you'll probably not see such high window sizes as we generally slowly increase the window as the transfer progresses as it's not good behaviour to instantly flood the link. However, with Window scaling enabled you should see the window quickly rise above that 64k mark and thus the throughput rate rises with it. We have very complicated algorithms in Windows to optimise the window size (from Vista onwards) which works out the optimum TCP window size on a per link basis.

    For example Windows might start with an 8k (multiplied by the scale factor) window and slowly increase it over the length of the transfer all being well.

    How do I check if it's enabled?

    So we've seen how essential this setting is, but how do you know if it's disabled in your environment? Well there a number of ways.

    Windows should have this enabled by default but you can check by running (on a modern OS version)

    Netsh int tcp show global

    "Receive Window Auto-Tuning level" should be "normal" by default.

    On a proxy, the name of this setting varies by device but may be referred to as RFC1323 such as is the case with Bluecoat.

    https://kb.bluecoat.com/index?page=content&id=FAQ1006

    This article suggests Bluecoat needs the TCP window value setting above 64k and RFC1323 setting is enabled for it to use window scaling. A lot of the older Bluecoat devices I've seen at customer sites have the window set at 64k which means scaling isn't used.

    However, we can quickly check that it is being used by taking a packet trace on the proxy itself. Bluecoat and Cisco proxies have built in mechanisms to take a packet capture. You can then open the trace in Netmon or Wireshark and look for the following.

    Firstly you need to find the session you are interested in and look at the TCP 3 way handshake. The scaling factor is negotiated in the handshake in the TCP options. Netmon shows this in the description field however as shown.

    Here you can see the proxy connecting to Office 365 and offering a scaling factor of zero, meaning we have a 64k window for Office 365 to send us data.

     

    7692       12:28:03 14/03/2014        12:28:03.8450000              0.0000000            100.8450000                        MyProxy             contoso.sharepointonline.com                TCP         TCP: [Bad CheckSum]Flags=......S., SrcPort=43210, DstPort=HTTPS(443), PayloadLen=0, Seq=3807440828, Ack=0, Win=65535 ( Negotiating scale factor 0x0 ) = 65535 

    7740       12:28:04 14/03/2014        12:28:04.1440000              0.2990000            101.1440000                        contoso.sharepointonline.com MyProxy TCP         TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=43210, PayloadLen=0, Seq=3293427307, Ack=3807440829, Win=4380 ( Negotiated scale factor 0x2 ) = 17520 

     

    In the TCP options of the packet you'll see this.

     

    - TCPOptions:

    + MaxSegmentSize: 1

    + NoOption:

    + WindowsScaleFactor: ShiftCount: 0

    + NoOption:

    + NoOption:

    + TimeStamp:

    + SACKPermitted:

    + EndofOptionList:

     

    If you're using Wireshark then it'll look something like this.

     

    1748    2014-03-14 12:56:40.277292    My Proxy    contoso.sharepointonline.com TCP    78    39747 > https [SYN] Seq=0 Win=65535 Len=0 MSS=1370 WS=1 TSval=653970932 TSecr=0 SACK_PERM=1    

     

    In the TCP options of the syn/synack packet the following is shown if window scaling is disabled.

    Window scale: 0 (multiply by 1)

    If you can't take a trace on the proxy itself, try taking one on a client connecting to the proxy and look at the syn/ack coming back from the proxy. If it is disabled there, then it's highly likely the proxy is disabling the setting on its outbound connections.

    It's worth pointing out here that the scaling factor is negotiated in the three way handshake and is then set for the lifetime of the TCP session. The window size however is variable and can be increased to a maximum of 65535 or decreased during the lifetime of the session. (Which is how we manage the autotuning in Windows).

     

    Summary:

    As we've seen here, TCP Window Scaling is an important feature and it's highly recommended that you ensure it is used on your network perimeter devices if you want to fully utilise the available bandwidth and have optimum performance for network traffic. Enabling the setting is normally very simple but the method varies by device. Sometimes it'll be referred to as TCP1323 and sometimes TCP Window Scaling but refer to the vendor supplying the device for specific instructions.

    Having this disabled can have quite a performance impact on Office 365 and also any other traffic flowing through the device. The impact will become greater the larger the bandwidth and larger the RTT, such as a high speed, long distance WAN link.

    http://technet.microsoft.com/en-us/magazine/2007.01.cableguy.aspx is a good article with a bit more detail on areas of this subject I haven't covered here.

    This setting is generally safe to use and it's rare that we see issues with it nowadays, just keep an eye on the traffic after enabling it and look out for dropped TCP sessions or poor performance which may be an indicator there is a device on the route which doesn't understand or deal with window scaling well.

    Ensuring this setting is enabled should be one of the first steps you do when assessing whether your network is optimised for Office 365 as having it disabled can have an enormous impact on the performance of the service and can be a very tricky problem to track down until you know to and how to look for it. And now you do!

     

  • Network Perimeters & TCP Idle session settings for Outlook on Office 365

    The Problem:

    One issue I run into time and time again when I do a network assessment for customers using Office 365, or troubleshoot issues with performance and connectivity with O365, is one with TCP Idle session settings on perimeter devices.

    Perimeter networks, including proxies and firewalls, are normally designed for internet access to web pages, which by its very nature tends to be transient. This means we don't expect TCP sessions to be idle for a long time, they make a request, get the response, and close.

    These perimeter networks are therefore often configured with this in mind, and any idle TCP session, that is, one which has seen no traffic for a period of time, is forcibly closed, or more commonly, simply dropped at the network edge device.

    When using a web page for example, the user wouldn't notice any issue with this, and if they refresh the page, or click on a link within that page, a new TCP connection will be fired up, unbeknown to the user.

    As we move into a Cloud connected world, this model which worked well for years, needs to be revisited as it doesn't work well with the way we now connect.

    Rather than being transient, Outlook connecting to Exchange (be it on-prem or Cloud based) opens up TCP connections and leaves them open for the length of time the application is open.

    Under most circumstances, these connections will see traffic on very regular intervals and thus any idle timeouts won't be an issue. However, it is a fact, and one I've seen occur many times, that Outlook, if not performing any actions, may not send any traffic on an open TCP connection for a long period of time.

    We saw this regularly an issue regularly when On-Prem was prevalent. Firewalls would kill idle TCP sessions after a period of time, causing disconnect pop-ups in Outlook, hangs or other problems within the application such as password prompts as it reconnected. These were often due to the firewall not informing the client of the disconnect by sending a reset. Thus when the client tried to use the connection again, it would send a packet, get no response, then retransmit five times, exponentially backing off each time until it gave up and fired up a new connection.

    This could take up to 30 seconds or more to timeout the TCP retransmits and thus cause hang problems within the application whilst the retransmissions take place.

    We used to fix this problem by either setting Windows to send a KeepAlive packet at an interval lower than the Firewall's idle timeout value, or adjust the firewall settings.

    http://support.microsoft.com/kb/2535656 describes this in more detail.

    KeepAlive packets are small packets which are sent at an interval to ensure the other end is still listening. The default for Windows Applications, when enabled on the socket, is 2 hours.

    With Office 365, we need to be even more wary of this problem. Unlike when using on premise, where we're mainly connected over a LAN, which don't often employ firewalls, we're now punching out to the Internet through a perimeter network awash with proxies and firewalls. These proxies and firewalls are all configured, quite understandably, for transient internet traffic.

    As Outlook keeps TCP connections open for a long time, and may not send data on those sessions for extended periods, we need to revisit the design of our perimeter networks to encompass and reflect this new method of connecting through them.

    A common example I've seen with these settings is where they are set, at different values, on three perimeter devices O365 traffic flows through:

     

    • NetScalers: HTTP/SSL timeouts 140 seconds

     

    • Proxies 100 seconds

     

    • Firewalls 300 seconds.

     

    These values, whilst fine for transient network traffic, as they will clean up any orphaned, unused TCP sessions quickly, do not fit well with the way Outlook works.

     

    Identifying the Issue:

    Below on my reproduction of this issue, you can see three examples of high latent times between packets on connections to Office 365 from a Windows client running Office 2010. Whilst none here are above the 100 seconds at which the example proxies will kill the session at, they were only from a small sample and show that it is normal for Outlook/Exchange to keep idle on TCP sessions for extended periods.

    Example 1:

    2607    11:17:09 25/02/2014    599.3042852    0.0000209    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51225, DstPort=HTTP (80), PayloadLen=0, Seq=2668778565, Ack=1520117633, Win=16318 (scale factor 0x2) = 65272    {TCP:27, IPv4:4}        

    2635    11:18:27 25/02/2014    677.3717532    78.0674680    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:32, SSLVersionSelector:31, HTTP:28,

    Example 2:

    2083    11:07:49 25/02/2014    39.9143045    0.0000522    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51256, DstPort=HTTP (80), PayloadLen=0, Seq=4013955568, Ack=1992955982, Win=16141 (scale factor 0x2) = 64564    {TCP:29, IPv4:4}        

    2373    11:09:11 25/02/2014    121.8444019    81.9300974    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:34, SSLVersionSelector:33, HTTP:30,

    Example 3:

    2643    11:47:38 25/02/2014    652.2202260    0.1199924    OUTLOOK.EXE    Proxy    Client    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=HTTP (80), DstPort=51405, PayloadLen=0, Seq=3293529079, Ack=944389853, Win=4095 (scale factor 0x4) = 65520    {TCP:70, IPv4:4}        

    2669    11:48:36 25/02/2014    710.7751792    58.5549532    OUTLOOK.EXE    Client    Proxy    TLS    TLS:TLS Rec Layer-1 SSL Application Data; TLS Rec Layer-2 SSL Application Data    

     

    With this current setup, TCP sessions from Outlook clients to Office 365 in the environment will be prematurely closed on occasion by these idle timeouts on the perimeter network, when the gap between packets is greater than 100 seconds.

    When troubleshooting this issue, use Netmon or Wireshark on the client machine and filter on the connections from Outlook to the proxy/Office 365. If you are hitting this issue you'll see a long delay between packets, as seen below, then most likely a series of five retransmits after the delay and eventually a reset as the client closes the connection due to no response. I'll try and reproduce this issue with netmon running and update this blog with an example trace at a future point in time.

    The above scenario presumes the perimeter device hasn't informed the client it's closed the connection by sending a reset (silently dropping the connection). If a reset is sent at the idle timeout, you'll simply see a time gap, and a reset then arrive from the Proxy/Firewall.

     

    Symptoms:

    In my experience with this behaviour I believe it's likely to cause the following, but not limited to the following problems:

    • Disconnect pop ups in Outlook
    • Unexpected authentication prompts
    • Hangs within Outlook where we get a 'polo mint' especially when switching mailboxes/calendars.
    • Performance problems
    • Mail stuck in outbox for an extended period

    When I encounter this issue, it's more commonly the power users who experience the problem the most, people like Exec Pas who switch between mailboxes and calendars on a fairly regular basis. The reason they see it more often us down to Outlook opening more TCP connections for those extra mailboxes, and the fact they aren't open/used all the time so are more at risk of being timed out at the firewall/proxy. When the user then switches mailboxes after this has happened, they run into issues/hangs.

    You may also see the issue occur more regularly after breaks or at lunchtime when you or your users leave the computer unused for an extended period.

     

    Resolution Advice:

    I have searched both internally and externally for 'official' advice around this and found none so have devised the following based on experience of fixing similar issues on customer environments, both on-premise and Office 365.

     

    1. Bring the SSL/TCP Idle Session timeout all perimeter devices into line with each other. Ideally, and if feasible, keeping a low setting for normal internet traffic of around 2-3 minutes. However, create a separate rule for Office 365 traffic, increase this value to as high a value as possible, in the region of > 2 hours (as Windows will send a keep alive by default at 2 hours).

       

    2. If the above isn't possible, then we can attack the problem in a combined way.

      Set the idle timeout on all perimeter devices to 30 minutes, and on the Windows clients edit the KeepAliveTime value to in the registry to a value of 25 minutes (1500000) This will cause any applications (including Outlook) which utilises keepalives to send a packet every 25 minutes which will prevent the perimeter devices, with a timeout of 30 minutes, from closing a connection which is still in use. However it will still enforce a clean-up of genuinely orphaned TCP sessions, albeit every 30 minutes instead of every 2-6 minutes.

      If this is thought to be too long a time for the perimeter, then these values can be reduced to suit the desired state by your network team. For example, Perimeter idle session timeout at 5 minutes and keep alive packets on Windows set at 4 minutes. Although I think this may be a little too aggressive, there is scope for a happy medium.

      http://blogs.technet.com/b/nettracer/archive/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives.aspx describes the keepalive key in more detail.

     

    *Update

    I'm aware of several articles such as http://support.microsoft.com/kb/2637629 which refer to an 8 hour HTTP/SSL timeout. I suspect the author is referring to an 8 hour timeout on active TCP sessions, i.e the firewall may have a setting to only allow TCP sessions going through it to be open for x hours, regardless of whether it's used or not.

    There wouldn't be much point setting this value for 8 hours for Idle TCP sessions as the default keepalive value is 2 hours so we'd never get anywhere near 8 hours. I doubt any network/security team would allow such a high value on idle TCP sessions anyway. So if your firewall or proxy has a setting to kill TCP sessions after they have been open for a set period, make sure this is set to 8 hours minimum for O365 traffic then look at setting the idle timeout as recommended above.

  • TCP Offloading/Chimney & RSS…What is it and should I disable it?

    Having decided to start this blog to convey my experience with network analysis and troubleshooting, one subject instantly sprang to mind for my first post, TCP Chimney/Offloading.

    I get asked about this so often I have a ready email of advice around what it is, and what I (note not Microsoft, although I think our official recommendation would likely mirror mine) recommend to customers around the use of it. My advice is based on thousands of customer cases I've handled over the years around this feature. I've therefore compiled what's hopefully a one-stop shop for all your TCP offloading needs. Apologies in advance if it's a bit wordy, but I've tried to convey everything I can around the subject for you.

    So why do I get asked about it all the time? Well, let's start with what it is and what it does.

     

    What is TCP Offloading/Chimney?

    Starting when Windows Server 2003 SP1 was the current server OS, Microsoft released the Scalable Networking Pack http://support.microsoft.com/kb/912222/en-us

    This turned on in the OS, three distinct things:

     

    1.) RSS

    Where multiple CPUs reside in a single computer, the Windows networking stack limits "receive" protocol processing to a single CPU. In essence, the old design of dealing with all incoming network traffic on a single processor core was starting to cause a bottleneck on newer multiprocessor systems. RSS resolves this issue by enabling the packets that are received from a network adapter to be balanced across multiple CPUs. In essence with RSS on, each incoming TCP connection is load balanced over the available cores, spreading the load and preventing a bottleneck from occurring. This is becoming a necessity as servers have to handle increasingly high loads of network traffic.

     

    2.) TCP Chimney (sometimes referred to as TCP Offloading)

    This feature is designed to take processing of the network such as packet segmentation and reassembly processing tasks, from a computer's CPU to a network adapter that supports TCP Chimney Offload. This has the effect of reducing the workload on the host CPU and moving it to the NIC, allowing both the Host OS to perform quicker and also speed up the processing of network traffic.

     

    3.) Network Direct Memory Access (NetDMA)

    http://technet.microsoft.com/en-us/library/gg162716(WS.10).aspx

    The NetDMA interface provides generic access to direct memory access (DMA) engines that can perform memory-to-memory data transfers with little CPU involvement. Again, this is designed to take work away from the CPU by allowing the NIC to move data from receive buffers without using the CPU as much.

    Why would I want to disable it then?

    All these features sound brilliant, and only enabled with the installation of the Scalable Networking Pack, so why would you want to disable it?

    Well, with the release of Service Pack 2 for Windows Server 2003, Microsoft decided to include this scalable networking pack and thus turn the features on. If a server has a NIC which supports these features, and it's enabled in the NIC properties (more on this later) then we'll use them. The problem with this approach was, that many NICs reported to the OS that they supported these features, which they indeed did, but many didn't perform these functions very well at all in reality. With 2003 this was an all or nothing effort, if offloading was turned on, we offloaded all TCP connections on a supported NIC, regardless of whether it would benefit or not.

    We had a great number of issues through Microsoft support over the course of the next year or so which were caused by drivers misbehaving with these features which had a knock on effect on the network traffic. This in turn caused a wide range of weird and wonderful symptoms seen across the board from the Exchange team to SQL to BizTalk to IIS to ISA. Most of these landed on the lap of my colleagues and I in the Networking support team and as such we've probably seen issues numbering in the thousands caused by these features, where turning them off resolved the problem, so much so that turning off Offloading/RSS became an almost standard troubleshooting step with 2003 cases.

    Most NIC vendors have released numerous updates over the years to resolve these issues as have Microsoft to improve the feature in the OS, however my general advice with Server 2003 is to disable these features.

    In essence, with server 2003 this feature tends to cause more problems than it solves so I tend to recommend it is turned off. It doesn't really become effective until you get high speed (10gbps) networks and low latency and, as mentioned, many older NIC drivers don't implement this well and thus we get a lot of issues. If enabled, the bare minimum recommendation would be that the NIC and teaming drivers and firmware are the latest available and http://support.microsoft.com/kb/950224/en-us is applied.

     

    How do I turn off these features in Server 2003?

    If the features aren't needed then they should be turned off, we can do this in one of two ways.

    The safest way is to do this in the registry by using method 3 in http://support.microsoft.com/kb/948496/en-us which sets the registry to configure these features to be off.

     

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

     

    • Right-click EnableTCPChimney, and then click Modify.
    • In the Value data box, type 0, and then click OK.
    • Right-click EnableRSS, and then click Modify.
    • In the Value data box, type 0, and then click OK.
    • Right-click EnableTCPA, and then click Modify.
    • In the Value data box, type 0, and then click OK.
    • Exit Registry Editor, and then restart the computer.

     

    Alternatively you can disable the features in the NIC properties of most NICs, however the naming convention and exposure of these settings varies from NIC to NIC and also from driver to driver.

     

    If the feature is disabled on either the NIC properties, or the registry, it's off, regardless if the other is on. This is why I recommend you use the registry as it won't be affected by driver updates etc and is much easier to control centrally.

     

    So what about newer OS versions?

    The first point of note is that Microsoft made a lot of effort in the newer OS versions to ensure the drivers were up to the job, and also a lot of improvements in the implementation of the features and TCP stack in the OS which makes the enabling of the features much safer in post 2003 OS versions.

     

    Server 2008

    In Windows Server 2008 (not R2) Offloading is turned off by default anyway. http://support.microsoft.com/kb/951037/en-us

    You can enable it using the NIC properties or by using Netsh which is outlined in the link above. The offloading capabilities are more granular in 2008 than they were in 2003, we offload suitable network connections.

    As before, it's always wise if using this feature that the latest NIC drivers and firmware are installed to ensure the NIC manufacturers latest updates are in place.

    Ensure you have http://support.microsoft.com/kb/976035/en-us installed on top of SP2 to prevent an unexpected restart scenaro.

    If the above steps are done, in my experience it's very safe to turn the feature on in 2008 if you feel it is needed.

    To check the state of offloading you can run the following steps:

     

    Run Netstat –t in a command prompt an you'll get the following output:

     

    Active Connections

     

    Proto Local Address Foreign Address State Offload State

    TCP 127.0.0.1:52613 computer_name:52614 ESTABLISHED InHost

    TCP 192.168.1.103:52614 computer_name:52613 ESTABLISHED Offloaded

     

    InHost shows the connection is not offloaded and thus handled by the OS, Offloaded mean exactly that.

    For those of you looking in memory dumps and wondering if these features are in use, you should be able to dump the registry keys used to set this by running x tcpip!*disable* in windbg for server 2003.

    In this example, both RSS and TCP Chimney is disabled.

    x tcpip!*disable*

    b8f1a0d4 tcpip!DisableRSS = 1

    b8f1a360 tcpip!DisableUserTOSSetting = 1

    b8f1df34 tcpip!DisableMediaSenseEventLog = 0

    b8f1a0d0 tcpip!DisableTCPChimney = 1

    b8f1ae54 tcpip!DisableTaskOffload = 0

    b8f1cdc0 tcpip!DisableLargeSendOffload = 0

    b8f1a0b4 tcpip!DisableIPSourceRouting = 2

    b8f1ae4c tcpip!DisableMediaSense = 0

    b8f1a0ec tcpip!DisableUserTOS = 1

    b8f01ff3 tcpip!DisableRouter (void)

    b8f0c3b0 tcpip!IPDisableMediaSenseRequest (struct _IRP *, struct _IO_STACK_LOCATION *)

    b8f106d6 tcpip!OlmDisableOffloadOnInterface (unsigned int)

    b8f04d4b tcpip!IPDisableChimneyOffload (struct _IRP *, struct _IO_STACK_LOCATION *)

    b8f048bf tcpip!IPDisableSniffer (struct _UNICODE_STRING *)

     

    As these structures don't exist in 2008 + you'll need to use another command which im currently trying to confirm what the best method is and I'll update the blog with the info.

     

    Server 2008 R2

    With server 2008 R2 this feature is much more intelligent, it'll only offload when the conditions are right..as per http://technet.microsoft.com/en-us/library/gg162709(WS.10).aspx

    Automatic. In automatic mode, TCP Chimney Offload considers offloading the processing for a connection only if the following criteria are met: the connection is established through a 10 Gbps Ethernet adapter, the mean round trip link latency is less than 20 milliseconds, and at least 130 KB of data has been exchanged over the connection. In automatic mode, the TCP receive window is set to 16 MB. Because the Windows stack has performance optimizations not found in Chimney-capable network adapters, automatic mode restricts offloads only to those connections that might receive the most benefit from it.

    This is the default setting and I'd advise it's left as default. As always, ensure the latest NIC drivers/Firmware is installed to remove the risk of any known issues but in my experience taken from many thousands of customers, this feature is a real benefit to the OS. In fact I've seen multiple customers who have gotten into the habit of disabling these features in their OS build following the issues they experienced with Server 2003. I've been called out to look at performance issues and when we've re-enabled the features we notice a massive performance improvement.

    If you are getting problems which are resolved by turning off TOE in 2008 R2, my first step would be to update the NIC driver and firmware as there are almost always updates for the NICs which resolve the majority of offloading issues I encounter.

    If the problem persists, turning off Offloading is the wrong thing to do, raise a case with Microsoft and we'll help you get to the bottom of it, by having a policy of disabling these features, you are effectively restricting your Windows platform's network performance for the sake of one or two issues which could be investigated and resolved.

    The performance improvement on certain connections is enormous and shouldn't be thrown away due to habit (i.e. the 2003 behaviour) or a few issues which haven't been fully investigated, quick fixes will eventually come round and bite you, in my personal experience.

    To manage the settings in 2008 R2 the following KB gives more information on the Netsh commands available.

    http://technet.microsoft.com/en-us/library/gg162682(v=ws.10).aspx

    It's also advised to install http://support.microsoft.com/kb/2477730/en-us to resolve an issue with offloading in 2008 R2, this is non urgent so could be planned into your next change window.

    Server 2012

    Offloading in Windows Server 2012 works much as it does in server 2008 R2 so the same advice applies. RSS however becomes more important in this OS due to the fact SMB Multichannel relies on it.

    http://support.microsoft.com/kb/2846837/en-us is a recommended hotfix for RSS in server 2012.

    Additional points of note around offloading:

    Large send offload and checksum offload

    I've seen many references on the internet pointing to things around TCP task offloading, such as Checksum offloading and Large Send offload being related to TCP chimney. Its important to note, these are not related to the TCP Chinmey/offload described above. Checksum offload is where we allow the NIC to set the checksum on a packet when it leaves the machine (which is why Netmon and Wireshark often show "incorrect checksum" on packets as the driver which captures them sits above the NIC where the checksum is set). Large Send offload (LSO) allows the Application layer to dump down a packet which would be too big for transmission and allows the NIC to chop it up into transmittable sizes (which is why you can see packets > 1460 bytes of payload in Netmon/Wireshark).

    These can be set in the NIC properties but are generally very very safe to leave on. You may want to disable LSO if you're sniffing traffic as you wont be seeing the packets as they are transmitted on the wire.

     

     

    Network tracing offloaded connections:

    Another reason you may want to disable TCP offloading is if you want to take a network trace. Both Netmon's filter driver and Wireshark's will show you only the three way handshake and the session tear down if offloading is being used. This is due to where the drivers sit, when offloading is used, the data bypasses these drivers so you'll only see the part of the session the OS is responsible for, the session setup and tear down.

    http://blogs.technet.com/b/messageanalyzer/archive/2012/09/17/meet-the-successor-to-microsoft-network-monitor.aspx is a new tool from Microsoft which allows us to trace at different layers other than NDIS (where netmon sits) and thus may allow you to work round this issue depending on the scenario.

    So for a short summary my recommendations for Offloading and RSS are:

    Server OS version

    RSS/Chimney On by default

    Recommended setting

    Methods to disable

    Additional Recommendations

    2003 SP2

    Yes

    Turn off unless needed

    NIC properties or registry

    Update NIC drivers and apply

    http://support.microsoft.com/kb/950224/en-us

    2008

    No

    Turn on only if needed

    Nic properties or Netsh

    Update NIC drivers and apply 

    http://support.microsoft.com/kb/979614/en-us

    http://support.microsoft.com/kb/967224/en-us

    if enabled

    2008 R2

    Yes (Only offloads suitable connections)

    Leave enabled

    NIC properties or Netsh

    Update NIC drivers and apply

    http://support.microsoft.com/kb/2958399

    http://support.microsoft.com/kb/2511305

    2012 inc R2

    Yes (Only offloads suitable connections)

    Leave enabled

    Nic properties or Netsh

    Update NIC drivers and apply

    http://support.microsoft.com/kb/2885978

    Please note, the hotfixes may seem irrelevant but are the latest versions (as of 19 August 2014) of the relevant binaries which contain the code for handling RSS and Offloading and thus contain any hotfixes to this date which may help performance & functionality in this area.

    • Server 2003: Turn it off unless absolutely needed
    • Server 2008: Off by default, Turn on if needed after a NIC driver update and Windows hotfix.
    • Server 2008 R2/2012: On, Automatic mode by default, leave as default and update NIC drivers if possible. Install hotfix on 2008 R2 in next change window.

     

    *Update.

     

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2055853 is a recommended fix if you're using RSS and VMware.

     

    Hopefully this post has provided you with all you need to know around TCP offloading/RSS and has encouraged you to keep it turned on in newer OS versions.