On The Wire

A detailed look inside the Ethernet Cable

March, 2014

Posts
  • Network Perimeters & TCP Idle session settings for Outlook on Office 365

    The Problem:

    One issue I run into time and time again when I do a network assessment for customers using Office 365, or troubleshoot issues with performance and connectivity with O365, is one with TCP Idle session settings on perimeter devices.

    Perimeter networks, including proxies and firewalls, are normally designed for internet access to web pages, which by its very nature tends to be transient. This means we don't expect TCP sessions to be idle for a long time, they make a request, get the response, and close.

    These perimeter networks are therefore often configured with this in mind, and any idle TCP session, that is, one which has seen no traffic for a period of time, is forcibly closed, or more commonly, simply dropped at the network edge device.

    When using a web page for example, the user wouldn't notice any issue with this, and if they refresh the page, or click on a link within that page, a new TCP connection will be fired up, unbeknown to the user.

    As we move into a Cloud connected world, this model which worked well for years, needs to be revisited as it doesn't work well with the way we now connect.

    Rather than being transient, Outlook connecting to Exchange (be it on-prem or Cloud based) opens up TCP connections and leaves them open for the length of time the application is open.

    Under most circumstances, these connections will see traffic on very regular intervals and thus any idle timeouts won't be an issue. However, it is a fact, and one I've seen occur many times, that Outlook, if not performing any actions, may not send any traffic on an open TCP connection for a long period of time.

    We saw this regularly an issue regularly when On-Prem was prevalent. Firewalls would kill idle TCP sessions after a period of time, causing disconnect pop-ups in Outlook, hangs or other problems within the application such as password prompts as it reconnected. These were often due to the firewall not informing the client of the disconnect by sending a reset. Thus when the client tried to use the connection again, it would send a packet, get no response, then retransmit five times, exponentially backing off each time until it gave up and fired up a new connection.

    This could take up to 30 seconds or more to timeout the TCP retransmits and thus cause hang problems within the application whilst the retransmissions take place.

    We used to fix this problem by either setting Windows to send a KeepAlive packet at an interval lower than the Firewall's idle timeout value, or adjust the firewall settings.

    http://support.microsoft.com/kb/2535656 describes this in more detail.

    KeepAlive packets are small packets which are sent at an interval to ensure the other end is still listening. The default for Windows Applications, when enabled on the socket, is 2 hours.

    With Office 365, we need to be even more wary of this problem. Unlike when using on premise, where we're mainly connected over a LAN, which don't often employ firewalls, we're now punching out to the Internet through a perimeter network awash with proxies and firewalls. These proxies and firewalls are all configured, quite understandably, for transient internet traffic.

    As Outlook keeps TCP connections open for a long time, and may not send data on those sessions for extended periods, we need to revisit the design of our perimeter networks to encompass and reflect this new method of connecting through them.

    A common example I've seen with these settings is where they are set, at different values, on three perimeter devices O365 traffic flows through:

     

    • NetScalers: HTTP/SSL timeouts 140 seconds

     

    • Proxies 100 seconds

     

    • Firewalls 300 seconds.

     

    These values, whilst fine for transient network traffic, as they will clean up any orphaned, unused TCP sessions quickly, do not fit well with the way Outlook works.

     

    Identifying the Issue:

    Below on my reproduction of this issue, you can see three examples of high latent times between packets on connections to Office 365 from a Windows client running Office 2010. Whilst none here are above the 100 seconds at which the example proxies will kill the session at, they were only from a small sample and show that it is normal for Outlook/Exchange to keep idle on TCP sessions for extended periods.

    Example 1:

    2607    11:17:09 25/02/2014    599.3042852    0.0000209    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51225, DstPort=HTTP (80), PayloadLen=0, Seq=2668778565, Ack=1520117633, Win=16318 (scale factor 0x2) = 65272    {TCP:27, IPv4:4}        

    2635    11:18:27 25/02/2014    677.3717532    78.0674680    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:32, SSLVersionSelector:31, HTTP:28,

    Example 2:

    2083    11:07:49 25/02/2014    39.9143045    0.0000522    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51256, DstPort=HTTP (80), PayloadLen=0, Seq=4013955568, Ack=1992955982, Win=16141 (scale factor 0x2) = 64564    {TCP:29, IPv4:4}        

    2373    11:09:11 25/02/2014    121.8444019    81.9300974    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:34, SSLVersionSelector:33, HTTP:30,

    Example 3:

    2643    11:47:38 25/02/2014    652.2202260    0.1199924    OUTLOOK.EXE    Proxy    Client    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=HTTP (80), DstPort=51405, PayloadLen=0, Seq=3293529079, Ack=944389853, Win=4095 (scale factor 0x4) = 65520    {TCP:70, IPv4:4}        

    2669    11:48:36 25/02/2014    710.7751792    58.5549532    OUTLOOK.EXE    Client    Proxy    TLS    TLS:TLS Rec Layer-1 SSL Application Data; TLS Rec Layer-2 SSL Application Data    

     

    With this current setup, TCP sessions from Outlook clients to Office 365 in the environment will be prematurely closed on occasion by these idle timeouts on the perimeter network, when the gap between packets is greater than 100 seconds.

    When troubleshooting this issue, use Netmon or Wireshark on the client machine and filter on the connections from Outlook to the proxy/Office 365. If you are hitting this issue you'll see a long delay between packets, as seen below, then most likely a series of five retransmits after the delay and eventually a reset as the client closes the connection due to no response. I'll try and reproduce this issue with netmon running and update this blog with an example trace at a future point in time.

    The above scenario presumes the perimeter device hasn't informed the client it's closed the connection by sending a reset (silently dropping the connection). If a reset is sent at the idle timeout, you'll simply see a time gap, and a reset then arrive from the Proxy/Firewall.

     

    Symptoms:

    In my experience with this behaviour I believe it's likely to cause the following, but not limited to the following problems:

    • Disconnect pop ups in Outlook
    • Unexpected authentication prompts
    • Hangs within Outlook where we get a 'polo mint' especially when switching mailboxes/calendars.
    • Performance problems
    • Mail stuck in outbox for an extended period

    When I encounter this issue, it's more commonly the power users who experience the problem the most, people like Exec Pas who switch between mailboxes and calendars on a fairly regular basis. The reason they see it more often us down to Outlook opening more TCP connections for those extra mailboxes, and the fact they aren't open/used all the time so are more at risk of being timed out at the firewall/proxy. When the user then switches mailboxes after this has happened, they run into issues/hangs.

    You may also see the issue occur more regularly after breaks or at lunchtime when you or your users leave the computer unused for an extended period.

     

    Resolution Advice:

    I have searched both internally and externally for 'official' advice around this and found none so have devised the following based on experience of fixing similar issues on customer environments, both on-premise and Office 365.

     

    1. Bring the SSL/TCP Idle Session timeout all perimeter devices into line with each other. Ideally, and if feasible, keeping a low setting for normal internet traffic of around 2-3 minutes. However, create a separate rule for Office 365 traffic, increase this value to as high a value as possible, in the region of > 2 hours (as Windows will send a keep alive by default at 2 hours).

       

    2. If the above isn't possible, then we can attack the problem in a combined way.

      Set the idle timeout on all perimeter devices to 30 minutes, and on the Windows clients edit the KeepAliveTime value to in the registry to a value of 25 minutes (1500000) This will cause any applications (including Outlook) which utilises keepalives to send a packet every 25 minutes which will prevent the perimeter devices, with a timeout of 30 minutes, from closing a connection which is still in use. However it will still enforce a clean-up of genuinely orphaned TCP sessions, albeit every 30 minutes instead of every 2-6 minutes.

      If this is thought to be too long a time for the perimeter, then these values can be reduced to suit the desired state by your network team. For example, Perimeter idle session timeout at 5 minutes and keep alive packets on Windows set at 4 minutes. Although I think this may be a little too aggressive, there is scope for a happy medium.

      http://blogs.technet.com/b/nettracer/archive/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives.aspx describes the keepalive key in more detail.

     

    *Update

    I'm aware of several articles such as http://support.microsoft.com/kb/2637629 which refer to an 8 hour HTTP/SSL timeout. I suspect the author is referring to an 8 hour timeout on active TCP sessions, i.e the firewall may have a setting to only allow TCP sessions going through it to be open for x hours, regardless of whether it's used or not.

    There wouldn't be much point setting this value for 8 hours for Idle TCP sessions as the default keepalive value is 2 hours so we'd never get anywhere near 8 hours. I doubt any network/security team would allow such a high value on idle TCP sessions anyway. So if your firewall or proxy has a setting to kill TCP sessions after they have been open for a set period, make sure this is set to 8 hours minimum for O365 traffic then look at setting the idle timeout as recommended above.

  • Ensuring your Office 365 network connection isn’t throttled by your Proxy

    One of the things I'm regularly running into when looking at performance issue with Office 365 customers is a network restricted by TCP Window scaling on the network egress point such as the Proxy used to connect to Office 365. This affects all connections through the device, be it to Azure, bbc.co.uk or google Bing. We'll concentrate on Office 365 here, but the advice applies regardless of where you're connecting to.

    I'll explain shortly what this is and why it's so important but to give you an idea of its impact, one customer I worked with recently saw the download time of a 14mb PDF from an EMEA tenant to an Australia (Sydney) based client improve from 500 seconds before, to 32 seconds after properly enabling this setting. Imagine that kind of impact on the performance of your SharePoint pages or pulling large emails in from Exchange online, it's a noticeable improvement which can change Office 365 in some circumstances from being frustrating to use, to being comparable to on-prem.

    Below is a visual representation of a real world example of the impact of TCP Window scaling on Office 365 use.

    The premise of this is quite complicated, hence the very wordy blog for those of you interested in the detail, but the resolution is easy, it's normally a single setting on a Proxy or NAT device so just skip to the end if you want the solution.

    What is TCP Window scaling?

    When the TCP RFC was first designed, a 16bit field in the TCP header was reserved for the TCP Window. This essentially being a receive buffer so you can send data to another machine up to the limit of this buffer, without waiting for an acknowledgement from the receiver to say it had received the data.

    This 16 bit field means the maximum value of this is 2^16 or 65535 bytes. I'm sure when this was envisaged, it was thought it would be difficult to send such an amount of data so quickly that we could saturate the buffer.

    However, as we now know, computing and networks have developed at such a pace that this value is now relatively tiny and it's entirely feasible to fill this buffer in a matter of milliseconds. When this occurs it causes the sender to back off sending until it receives an acknowledgement from the receiving machine which has the obvious effect of causing slow throughput.

    A solution was provided in RFC 1323 which describes the TCP window scaling mechanism. This is a method which allows us to assign a number in a 3 byte field which is in essence a multiplier of the TCP window size described above.

    It's referred to as the shift count (as we're essentially shifting bits right) and has a valid value of 0-14. With this value we can theoretically increase the TCP window size to around 1gb.

    The way we get to the actual TCP Window size is (2^Scalingfactor)*TCP Window Size so If we take the maximum possible values it would be (2^14)*65535 = 1073725440 bytes or just over 1gb.

     

    TCP Window Scaling enabled?

    Maximum TCP receive buffer (Bytes)

    No

    65535 (64k)

    Yes

    1073725440 (1gb)

     

    What is the impact of having this disabled?

    The obvious issue is that if we fill this buffer, the sender has to back off sending any more data until it receives an acknowledgement from the receiving node so we end up with a stop start data transfer which will be slow. However, there is a bigger issue and it's most apparent on high bandwidth, high latency connections, for example, an intercontinental link (from Australia to an EMEA datacentre for example) or a satellite link.

    To use a high bandwidth link efficiently we want to fill the connection with as much data as possible as quickly as possible. With a TCP Window size limited to 64k when we have Window Scaling disabled we can't get anywhere near filling this pipe and thus use all the bandwidth available.

    With a couple of bits of data we can work out exactly what the mathematical maximum throughput can be on the link.

    Let's assume we've got a round trip time (RTT) from Australia to the EMEA datacentre of 300ms (RTT being the time it takes to get a packet to the other end of the link and back) and we've got TCP Window Scaling disabled so we've got a maximum TCP window of 65535 bytes.

    So with these two figures, and an assumption the link is 1000Mbit/sec we can work out the maximum throughput by using the following calculation:

     

    Throughput = TCP maximum receive windowsize / RTT

     

    There are various calculators available online to help you check your maths with this, a good example which I use is http://www.switch.ch/network/tools/tcp_throughput/

     

    • TCP buffer required to reach 1000 Mbps with RTT of 300.0 ms >= 38400.0 KByte
    • Maximum throughput with a TCP window of 64 KByte and RTT of 300.0 ms <= 1.71 Mbit/sec.

     

    So we need a TCP Window size of 38400 Kbyte to saturate the pipe and use the full bandwidth, instead we're limited to <=1.71 Mbit/sec on this link due to the 64k window, which is a fraction of the possible throughput.

    The higher the round trip time the more obvious this problem becomes but as the table below shows, even with a RTT of 1 second (which is extremely good and unlikely to occur other than on the local network segment) we cannot fully utilise the 1000 Mbps on the link.

    Presuming a 1000 Mbps link here is the maximum throughput we can get with TCP window scaling disabled:

    RTT (ms)

    Maximum Throughput (Mbit/sec)

    300

    1.71

    200

    2.56

    100

    5.12

    50

    10.24

    25

    20.48

    10

    51.20

    5

    102.40

    1

    512.00

    So it's clear from this table that even with an extremely low RTT of 1ms, we simply cannot use this link to full capacity due to having window scaling disabled.

     

    What is the impact of enabling this setting?

    We used to see the occasional issue caused by this setting when older networking equipment didn't understand the Scaling Factor and this caused issues with connectivity. However it's very rare to see issues like this nowadays and thus it's extremely safe to have the setting enabled. Windows has this enabled by default since Windows 2000 (if my memory serves).

    As we're increasing the window size beyond 64k the sending machine can therefore push more data onto the network before having to stop and wait for an acknowledgement from the receiver.

    To compare the maximum throughput with the table above, this table shows a modest scale factor of 8 and a maximum TCP window of 64k.

    (2^8)*65535 = Maximum window size of 16776960 bytes

    The following table shows we can saturate the 1000 Mbps link at a 100ms RTT when we have window scaling enabled. Even on a 300ms RTT we're still able to achieve a massively greater throughput than with window scaling disabled.

    RTT (ms)

    Maximum Throughput (Mbit/sec)

    300

    447.36

    200

    655.32

    100

    1310.64

    50

    2684.16

    25

    5368.32

    10

    13420.80

    5

    26841.60

    1

    134208.00

     

    In reality you'll probably not see such high window sizes as we generally slowly increase the window as the transfer progresses as it's not good behaviour to instantly flood the link. However, with Window scaling enabled you should see the window quickly rise above that 64k mark and thus the throughput rate rises with it. We have very complicated algorithms in Windows to optimise the window size (from Vista onwards) which works out the optimum TCP window size on a per link basis.

    For example Windows might start with an 8k (multiplied by the scale factor) window and slowly increase it over the length of the transfer all being well.

    How do I check if it's enabled?

    So we've seen how essential this setting is, but how do you know if it's disabled in your environment? Well there a number of ways.

    Windows should have this enabled by default but you can check by running (on a modern OS version)

    Netsh int tcp show global

    "Receive Window Auto-Tuning level" should be "normal" by default.

    On a proxy, the name of this setting varies by device but may be referred to as RFC1323 such as is the case with Bluecoat.

    https://kb.bluecoat.com/index?page=content&id=FAQ1006

    This article suggests Bluecoat needs the TCP window value setting above 64k and RFC1323 setting is enabled for it to use window scaling. A lot of the older Bluecoat devices I've seen at customer sites have the window set at 64k which means scaling isn't used.

    However, we can quickly check that it is being used by taking a packet trace on the proxy itself. Bluecoat and Cisco proxies have built in mechanisms to take a packet capture. You can then open the trace in Netmon or Wireshark and look for the following.

    Firstly you need to find the session you are interested in and look at the TCP 3 way handshake. The scaling factor is negotiated in the handshake in the TCP options. Netmon shows this in the description field however as shown.

    Here you can see the proxy connecting to Office 365 and offering a scaling factor of zero, meaning we have a 64k window for Office 365 to send us data.

     

    7692       12:28:03 14/03/2014        12:28:03.8450000              0.0000000            100.8450000                        MyProxy             contoso.sharepointonline.com                TCP         TCP: [Bad CheckSum]Flags=......S., SrcPort=43210, DstPort=HTTPS(443), PayloadLen=0, Seq=3807440828, Ack=0, Win=65535 ( Negotiating scale factor 0x0 ) = 65535 

    7740       12:28:04 14/03/2014        12:28:04.1440000              0.2990000            101.1440000                        contoso.sharepointonline.com MyProxy TCP         TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=43210, PayloadLen=0, Seq=3293427307, Ack=3807440829, Win=4380 ( Negotiated scale factor 0x2 ) = 17520 

     

    In the TCP options of the packet you'll see this.

     

    - TCPOptions:

    + MaxSegmentSize: 1

    + NoOption:

    + WindowsScaleFactor: ShiftCount: 0

    + NoOption:

    + NoOption:

    + TimeStamp:

    + SACKPermitted:

    + EndofOptionList:

     

    If you're using Wireshark then it'll look something like this.

     

    1748    2014-03-14 12:56:40.277292    My Proxy    contoso.sharepointonline.com TCP    78    39747 > https [SYN] Seq=0 Win=65535 Len=0 MSS=1370 WS=1 TSval=653970932 TSecr=0 SACK_PERM=1    

     

    In the TCP options of the syn/synack packet the following is shown if window scaling is disabled.

    Window scale: 0 (multiply by 1)

    If you can't take a trace on the proxy itself, try taking one on a client connecting to the proxy and look at the syn/ack coming back from the proxy. If it is disabled there, then it's highly likely the proxy is disabling the setting on its outbound connections.

    It's worth pointing out here that the scaling factor is negotiated in the three way handshake and is then set for the lifetime of the TCP session. The window size however is variable and can be increased to a maximum of 65535 or decreased during the lifetime of the session. (Which is how we manage the autotuning in Windows).

     

    Summary:

    As we've seen here, TCP Window Scaling is an important feature and it's highly recommended that you ensure it is used on your network perimeter devices if you want to fully utilise the available bandwidth and have optimum performance for network traffic. Enabling the setting is normally very simple but the method varies by device. Sometimes it'll be referred to as TCP1323 and sometimes TCP Window Scaling but refer to the vendor supplying the device for specific instructions.

    Having this disabled can have quite a performance impact on Office 365 and also any other traffic flowing through the device. The impact will become greater the larger the bandwidth and larger the RTT, such as a high speed, long distance WAN link.

    http://technet.microsoft.com/en-us/magazine/2007.01.cableguy.aspx is a good article with a bit more detail on areas of this subject I haven't covered here.

    This setting is generally safe to use and it's rare that we see issues with it nowadays, just keep an eye on the traffic after enabling it and look out for dropped TCP sessions or poor performance which may be an indicator there is a device on the route which doesn't understand or deal with window scaling well.

    Ensuring this setting is enabled should be one of the first steps you do when assessing whether your network is optimised for Office 365 as having it disabled can have an enormous impact on the performance of the service and can be a very tricky problem to track down until you know to and how to look for it. And now you do!