The Problem:

One issue I run into time and time again when I do a network assessment for customers using Office 365, or troubleshoot issues with performance and connectivity with O365, is one with TCP Idle session settings on perimeter devices.

Perimeter networks, including proxies and firewalls, are normally designed for internet access to web pages, which by its very nature tends to be transient. This means we don't expect TCP sessions to be idle for a long time, they make a request, get the response, and close.

These perimeter networks are therefore often configured with this in mind, and any idle TCP session, that is, one which has seen no traffic for a period of time, is forcibly closed, or more commonly, simply dropped at the network edge device.

When using a web page for example, the user wouldn't notice any issue with this, and if they refresh the page, or click on a link within that page, a new TCP connection will be fired up, unbeknown to the user.

As we move into a Cloud connected world, this model which worked well for years, needs to be revisited as it doesn't work well with the way we now connect.

Rather than being transient, Outlook connecting to Exchange (be it on-prem or Cloud based) opens up TCP connections and leaves them open for the length of time the application is open.

Under most circumstances, these connections will see traffic on very regular intervals and thus any idle timeouts won't be an issue. However, it is a fact, and one I've seen occur many times, that Outlook, if not performing any actions, may not send any traffic on an open TCP connection for a long period of time.

We saw this regularly an issue regularly when On-Prem was prevalent. Firewalls would kill idle TCP sessions after a period of time, causing disconnect pop-ups in Outlook, hangs or other problems within the application such as password prompts as it reconnected. These were often due to the firewall not informing the client of the disconnect by sending a reset. Thus when the client tried to use the connection again, it would send a packet, get no response, then retransmit five times, exponentially backing off each time until it gave up and fired up a new connection.

This could take up to 30 seconds or more to timeout the TCP retransmits and thus cause hang problems within the application whilst the retransmissions take place.

We used to fix this problem by either setting Windows to send a KeepAlive packet at an interval lower than the Firewall's idle timeout value, or adjust the firewall settings.

http://support.microsoft.com/kb/2535656 describes this in more detail.

KeepAlive packets are small packets which are sent at an interval to ensure the other end is still listening. The default for Windows Applications, when enabled on the socket, is 2 hours.

With Office 365, we need to be even more wary of this problem. Unlike when using on premise, where we're mainly connected over a LAN, which don't often employ firewalls, we're now punching out to the Internet through a perimeter network awash with proxies and firewalls. These proxies and firewalls are all configured, quite understandably, for transient internet traffic.

As Outlook keeps TCP connections open for a long time, and may not send data on those sessions for extended periods, we need to revisit the design of our perimeter networks to encompass and reflect this new method of connecting through them.

A common example I've seen with these settings is where they are set, at different values, on three perimeter devices O365 traffic flows through:

 

  • NetScalers: HTTP/SSL timeouts 140 seconds

 

  • Proxies 100 seconds

 

  • Firewalls 300 seconds.

 

These values, whilst fine for transient network traffic, as they will clean up any orphaned, unused TCP sessions quickly, do not fit well with the way Outlook works.

 

Identifying the Issue:

Below on my reproduction of this issue, you can see three examples of high latent times between packets on connections to Office 365 from a Windows client running Office 2010. Whilst none here are above the 100 seconds at which the example proxies will kill the session at, they were only from a small sample and show that it is normal for Outlook/Exchange to keep idle on TCP sessions for extended periods.

Example 1:

2607    11:17:09 25/02/2014    599.3042852    0.0000209    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51225, DstPort=HTTP (80), PayloadLen=0, Seq=2668778565, Ack=1520117633, Win=16318 (scale factor 0x2) = 65272    {TCP:27, IPv4:4}        

2635    11:18:27 25/02/2014    677.3717532    78.0674680    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:32, SSLVersionSelector:31, HTTP:28,

Example 2:

2083    11:07:49 25/02/2014    39.9143045    0.0000522    OUTLOOK.EXE    Client    Proxy    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=51256, DstPort=HTTP (80), PayloadLen=0, Seq=4013955568, Ack=1992955982, Win=16141 (scale factor 0x2) = 64564    {TCP:29, IPv4:4}        

2373    11:09:11 25/02/2014    121.8444019    81.9300974    OUTLOOK.EXE    Proxy    Client    TLS    TLS:TLS Rec Layer-1 SSL Application Data    {TLS:34, SSLVersionSelector:33, HTTP:30,

Example 3:

2643    11:47:38 25/02/2014    652.2202260    0.1199924    OUTLOOK.EXE    Proxy    Client    TCP    TCP: [Bad CheckSum]Flags=...A...., SrcPort=HTTP (80), DstPort=51405, PayloadLen=0, Seq=3293529079, Ack=944389853, Win=4095 (scale factor 0x4) = 65520    {TCP:70, IPv4:4}        

2669    11:48:36 25/02/2014    710.7751792    58.5549532    OUTLOOK.EXE    Client    Proxy    TLS    TLS:TLS Rec Layer-1 SSL Application Data; TLS Rec Layer-2 SSL Application Data    

 

With this current setup, TCP sessions from Outlook clients to Office 365 in the environment will be prematurely closed on occasion by these idle timeouts on the perimeter network, when the gap between packets is greater than 100 seconds.

When troubleshooting this issue, use Netmon or Wireshark on the client machine and filter on the connections from Outlook to the proxy/Office 365. If you are hitting this issue you'll see a long delay between packets, as seen below, then most likely a series of five retransmits after the delay and eventually a reset as the client closes the connection due to no response. I'll try and reproduce this issue with netmon running and update this blog with an example trace at a future point in time.

The above scenario presumes the perimeter device hasn't informed the client it's closed the connection by sending a reset (silently dropping the connection). If a reset is sent at the idle timeout, you'll simply see a time gap, and a reset then arrive from the Proxy/Firewall.

 

Symptoms:

In my experience with this behaviour I believe it's likely to cause the following, but not limited to the following problems:

  • Disconnect pop ups in Outlook
  • Unexpected authentication prompts
  • Hangs within Outlook where we get a 'polo mint' especially when switching mailboxes/calendars.
  • Performance problems
  • Mail stuck in outbox for an extended period

When I encounter this issue, it's more commonly the power users who experience the problem the most, people like Exec Pas who switch between mailboxes and calendars on a fairly regular basis. The reason they see it more often us down to Outlook opening more TCP connections for those extra mailboxes, and the fact they aren't open/used all the time so are more at risk of being timed out at the firewall/proxy. When the user then switches mailboxes after this has happened, they run into issues/hangs.

You may also see the issue occur more regularly after breaks or at lunchtime when you or your users leave the computer unused for an extended period.

 

Resolution Advice:

I have searched both internally and externally for 'official' advice around this and found none so have devised the following based on experience of fixing similar issues on customer environments, both on-premise and Office 365.

 

  1. Bring the SSL/TCP Idle Session timeout all perimeter devices into line with each other. Ideally, and if feasible, keeping a low setting for normal internet traffic of around 2-3 minutes. However, create a separate rule for Office 365 traffic, increase this value to as high a value as possible, in the region of > 2 hours (as Windows will send a keep alive by default at 2 hours).

     

  2. If the above isn't possible, then we can attack the problem in a combined way.

    Set the idle timeout on all perimeter devices to 30 minutes, and on the Windows clients edit the KeepAliveTime value to in the registry to a value of 25 minutes (1500000) This will cause any applications (including Outlook) which utilises keepalives to send a packet every 25 minutes which will prevent the perimeter devices, with a timeout of 30 minutes, from closing a connection which is still in use. However it will still enforce a clean-up of genuinely orphaned TCP sessions, albeit every 30 minutes instead of every 2-6 minutes.

    If this is thought to be too long a time for the perimeter, then these values can be reduced to suit the desired state by your network team. For example, Perimeter idle session timeout at 5 minutes and keep alive packets on Windows set at 4 minutes. Although I think this may be a little too aggressive, there is scope for a happy medium.

    http://blogs.technet.com/b/nettracer/archive/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives.aspx describes the keepalive key in more detail.

 

*Update

I'm aware of several articles such as http://support.microsoft.com/kb/2637629 which refer to an 8 hour HTTP/SSL timeout. I suspect the author is referring to an 8 hour timeout on active TCP sessions, i.e the firewall may have a setting to only allow TCP sessions going through it to be open for x hours, regardless of whether it's used or not.

There wouldn't be much point setting this value for 8 hours for Idle TCP sessions as the default keepalive value is 2 hours so we'd never get anywhere near 8 hours. I doubt any network/security team would allow such a high value on idle TCP sessions anyway. So if your firewall or proxy has a setting to kill TCP sessions after they have been open for a set period, make sure this is set to 8 hours minimum for O365 traffic then look at setting the idle timeout as recommended above.