Recently we’ve been receiving (by coincidence) some escalations to troubleshoot the same issue.  It’s about .Net apps returning the following exception:

 

SocketException::.ctor() 10061:No connection could be made because the target machine actively refused it

 

The cases I’ve been working with show different stack traces but the main point here is that exception is originally occurring when the application is trying to make use of a TCP socket to establish a connection with a remote machine.  For all the cases escalated to us so far, the problem has ended up being simply what the error message states, i.e the remote machine, or something in between, refuses the connection attempt.  This post is just to show how we’ve been able to troubleshoot these problems and isolate them to network related issues.

Let’s go through two different scenarios and see step by step how to troubleshoot them.  In the first scenario we have an IIS server (which we will call from now on WEB1 – IP:10.0.0.1) hosting a .Net application which is intermittently failing to make a web services call to another IIS server (let’s call it WEB2) located in a remote environment.  The problem has already been isolated to the point when the .Net app in the WEB1 makes a call to a web service hosted in the server WEB2.

The two last frames on the stack for this first scenario would be the following ones:

 

System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)

System.Web.Services.Protocols.WebClientProtocol.GetWebResponse(WebRequest request)

 

The first step, as always, is understand the infrastructure involved in the communication between WEB1 and WEB2.  Making it short, WEB2 is the virtual name used to mask a round robin DNS rotation between 10 other web servers and between the WEB1 and WEB2 there are at least 2 firewalls, one at the WEB1 location and another one at the WEB2 location (classic environment where the WEB servers are located in the DMZ provided by both firewalls on their respective locations).

Next step is to generate a network trace from the server WEB1 (it would be also nice if we could generate a trace simultaneously from the WEB2, however first in this scenario we don’t have access to the WEB2 and even if so we would need to take a trace from all the 10 web servers because we don’t know at this point to which the WEB1 is trying to establish a connection when the problem happens.  So we went ahead, installed the Microsoft Network Monitor tool in the WEB1 started collecting the network trace and waited until the problem happens.  In the mean time while the problem did not happen, we just got informed about the other 10 IP addresses of the 10 web servers hereby called simply by WEB2.

Ok, the problem happened, so now it’s time to dive in the network trace:

All the know-how we need now it’s a basic knowledge about how TCP connections happen.  At the moment of the problem the trace shows the following suspect frames:

 

2         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       TCP      TCP: Flags=.S......, SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621893, Ack=0, Win=16384 (scale factor 0) = 16384

3         {TCP:2, IPv4:1}            192.168.0.1       10.0.0.1            TCP      TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790416, Ack=3411621894, Win=0 (scale factor 0) = 0

4         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       TCP      TCP: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621894, Ack=1096790417, Win=16616 (scale factor 0) = 16616

5         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

6         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

7         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

8         {TCP:2, IPv4:1}            4.79.185.20       10.0.0.1            TCP      TCP: Flags=..R....., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790417, Ack=843163868, Win=0 (scale factor 0) = 0

 

The frame #2 represents the server WEB1 starting a TCP 3 way hand shake with the server whose the IP address is 4.79.185.20 at the port TCP 80 (remember that this is a web service call) which at this point is the one representing our server WEB2 (there are 10 possible IP addresses depending on which one the DNS round robin will provide through the name resolution).  The 3 way hand shake has been successfully accomplished through the frames #2,#3 and #4 (the sequence SYN, ACK SYN, ACK has occurred).  After that something interesting happens:  The WEB1 starts sending packets (starting on frame #5) of 1 single byte (notice that difference between the sequence number results in 1 -> 3411621895 - 3411621894 = 1) instead of making the complete POST to the web service.

The WEB1 receives no acknowledge from WEB2 to the packet sent at the frame #5, so it sends it again (that’s why the frame #6 and #7 are being considered retransmissions) in the frame #6, again the WEB2 doesn’t ack that, so WEB1 keeps resending that 1 byte packet in the frame #7 and then finally the WEB2 responds to that but with a RESET instead.  The packet sent by the WEB2 at the frame #8 in unexpected and will cause the application to handle that as the exception mentioned in the beginning of this article.

The question is:  Why the server WEB1 keeps sending a 1 byte length packet after the hand shake? 

Before answering this question, you might make a mistake and think that the WEB2 is legitimately resetting the connection with the WEB1 since WEB1 has not behaved as expected (which would be send a POST to the web service and not the 1 byte packets) and this would take you to the assumption that there is something wrong with the server WEB1 and lead you to completely wrong troubleshooting line… Didn’t you see what’s wrong yet?  Ok, it’s really a little tricky but the problem here happens at the hand shake (event though is was successful one)… Do you see now?  No?  Ok, look again to the 3 frames that compose the hand shake, more precisely to the frame #3.  Notice that the WINDOW size is 0 (Win=0) which basically means the server WEB2 can not receive anything larger than 0 bytes… Interesting, isn’t it?  So after receiving the 0 bytes length window size advertised by the WEB2, the WEB1 sends a 1 byte packet in attempt to make the WEB2 to advertise a new valid WINDOW size, which doesn’t happen after the first attempt, so the WEB1 keeps trying until the WEB2 resets the connection.

So the problem has been isolated to be at the WEB2 end.  Another question remains:  Why is the problem intermittent?  After all, if this behavior of advertising an invalid window size is always happening, the problem should also always happen.  It’s time to take a new network trace…

On a new network trace, we have both situations:  When it works and when it doesn’t work.  We already know what happens when it doesn’t work, so now let’s see what happens when it works:

 

2         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       TCP      TCP: Flags=.S......, SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591585, Ack=0, Win=16384 (scale factor 0) = 16384

3         {TCP:2, IPv4:1}            8.7.82.20           10.0.0.1            TCP      TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4961, Len=0, Seq=1446804074, Ack=843591586, Win=8760 (scale factor 0) = 8760

4         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       TCP      TCP: Flags=....A..., SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591586, Ack=1446804075, Win=16560 (scale factor 0) = 16560

5         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       HTTP     HTTP: Request, POST /WebSrv/QueryWebSrv.asmx

6         {HTTP:3, TCP:2, IPv4:1}            192.168.0.2       10.0.0.1            HTTP     HTTP: Response, HTTP/1.1, Status Code = 100

 

Ok, now we don’t see the same window size problem and 3 way hand shake is ok and everything is fine.  However our WEB1 server is not connecting to the same server WEB2 as now the server WEB2 is being represented by a different IP address – 192.168.0.2 as oppose to 192.168.0.1 that we’ve see before – due to the round robin balance system in place.  So, it answers our second question:  The problem actually happens always when the WEB2 is the 192.168.0.1 server, however since there is a round robin balance in place, the server WEB2 will not always be the one whose the IP address is 192.168.0.1.

The solution of problem?  Well, after isolating the problem we contacted the people in charge of the WEB2 end and they generated new traces on their end and found out there was a firewall in between the servers which was misconfigured regarding the IP 192.168.0.1.  They reconfigure the firewall and this fixed the problem for our first scenario.

The second scenario is a IIS server (from now one WEB1 IP: 10.0.0.1) reporting the same error as above, however now the stack, even though the exception happens at the same place, is a little different as below:

 

System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)

Microsoft.AnalysisServices.AdmdCleintXmlaClient.GetTcpClient(ConnectionInfo connectionInfo)

The difference here is just that in this scenario our .Net application being hosted by IIS is calling a SQL Server (IP: 172.16.0.1) running the Analysis Services as opposed to a web service.  The architecture of this scenario is a little bit easier to work with since the problem has already been isolate to one specific SQL Server when it’s accessed from any web server.

By following the logic as before, we will first generate a network trace from the WEB1 server (again, we don’t have control over the other end – the SQL server, as it would be useful to also generate a network trace from that server).  By looking at the trace we see the following:

7447                {TCP:534, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7448                {TCP:534, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

7467                {TCP:535, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7468                {TCP:535, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

7554                {TCP:538, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7556                {TCP:538, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

What we see here is the WEB1 making successive attempts to connect to the port TCP:2383 (the default port for non named instances of Analysis Services running in SQL Server 2005) and the SQL Server is resetting all of them.  We realized that there are two routers between the server WEB1 and the SQL Server.  We can’t identify if the resets are coming from the SQL or any of the routers since the MAC address in the reset packets will always be the one of the last router in the network path between the servers.  However by doing closer look at the reset packets, we notice that the TTL has been decremented by 2 (ttl=126) as below:

 

+ Ethernet: Etype = Internet IP (IPv4)

- Ipv4: Next Protocol = TCP, Packet ID = 14341, Total IP Length = 40

  + Versions: IPv4, Internet Protocol; Header Length = 20

  + DifferentiatedServicesField: DSCP: 0, ECN: 0

    TotalLength: 40 (0x28)

    Identification: 14341 (0x3805)

  + FragmentFlags: 0 (0x0)

    TimeToLive: 126 (0x7E)

    NextProtocol: TCP, 6(0x6)

    Checksum: 9687 (0x25D7)

    SourceAddress: 10.205.100.146

    DestinationAddress: 10.205.100.200

- Tcp: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

    SrcPort: 2383

    DstPort: 20531

    SequenceNumber: 0 (0x0)

    AcknowledgementNumber: 655839664 (0x271751B0)

  + DataOffset: 80 (0x50)

  + Flags: ..R.A...

    Window: 0 (scale factor 0) = 0

    Checksum: 65170 (0xFE92)

    UrgentPointer: 0 (0x0)

 

The TTL being decremented by two basically means this packet has been routed twice and since we know there are two routers between the servers, we can conclude the resets are really being originated at the SQL server and not by any of the routers.  Why is the SQL Server resetting these connection attempts?  The answer for this question is also the solution for the problem and it is that the port TCP 2383 is the one used by SQL server 2005 but our SQL Server is a SQL Server 2000 which has the default instance for the Analysis Services running in the port TCP 2725 instead.

We confirmed that by running a “netstat –b” at the SQL Server.  Here is the abbreviated version of the output:

TCP    svp020064:2725         SVP060005:42802        ESTABLISHED     2344

 

TCP    svp020064:2725         SVP060002:58478        ESTABLISHED     2344

 

TCP    svp020064:2725         SVP060001:2064         ESTABLISHED     2344

We now use the command “tasklist” to identify which process has the PID 2344.  The abbreviated version of the output is below:

Image Name                     PID            Session Name                   Session#           Mem Usage

============   =======  =============       ===========           ==========

msmdsrv.exe                 2344                   Console                              0           734,136 K

 

The solution is to change the client application so it connects to the right port which is the TCP 2725 instead of TCP 2383.