In this blog post, I’ll be talking about an interesting problem that I dealt with recently. The problem was that clients running in a certain VLAN were not able to establish HTTPS connections through TMG server. Due to the nature of the network, the clients should be configured as SecureNet clients (my customer cannot configure them as web proxy clients or use TMG client software because these machines are guest machines)
I asked for the usual data from our customer to find out what was happening during the problem:
- Network trace & HTTPWatch logs from a test client
- TMG data packager from the TMG server
1. After receiving the data, I started from client side. What I see on the client side is the client starts failing right after the initial TCP 3-way handshake:
Note: Please note that, for privacy purposes all IP addresses have been replaced with random addresses.
=> Client network trace:
(ip.addr eq 10.110.233.50 and ip.addr eq 172.16.1.1) and (tcp.port eq 50183 and tcp.port eq 443)
453 14:44:32 01.05.2012 27.1395045 0.0000000 10.110.233.50 172.16.1.1 TCP TCP: [Bad CheckSum]Flags=......S., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515135, Ack=0, Win=8192 ( Negotiating scale factor 0x2 ) = 8192
456 14:44:32 01.05.2012 27.1565249 0.0170204 172.16.1.1 10.110.233.50 TCP TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=2, Seq=2180033885 - 2180033888, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
457 14:44:32 01.05.2012 27.1565443 0.0000194 10.110.233.50 172.16.1.1 TCP TCP: [Bad CheckSum]Flags=...A...., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515136, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
458 14:44:32 01.05.2012 27.1585176 0.0019733 10.110.233.50 172.16.1.1 TLS TLS:TLS Rec Layer-1 HandShake: Client Hello.
465 14:44:32 01.05.2012 27.4744962 0.3159786 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #458] [Bad CheckSum]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
476 14:44:33 01.05.2012 28.0828875 0.6083913 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #458] [Bad CheckSum]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
494 14:44:34 01.05.2012 29.2948179 1.2119304 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #458]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
512 14:44:35 01.05.2012 30.3815127 1.0866948 172.16.1.1 10.110.233.50 TLS TLS:Continued Data: 2 Bytes
513 14:44:35 01.05.2012 30.3815385 0.0000258 10.110.233.50 172.16.1.1 TCP TCP:Flags=...A...., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
516 14:44:35 01.05.2012 30.5019345 0.1203960 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #458]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
535 14:44:37 01.05.2012 31.7018989 1.1999644 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #458] [Bad CheckSum]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
568 14:44:38 01.05.2012 33.1872833 1.4853844 172.16.1.1 10.110.233.50 TLS TLS:Continued Data: 6 Bytes
- Client cannot connect to remote web site because SSL/TLS negotiation doesn’t succeed because no response from the Web server is received (from client’s perspective)
2. Then I decided to check things from TMG server perspective. I first checked the network trace that was collected on the internal interface of TMG server through which the client request was received:
6457 14:33:49 01.05.2012 39.8915584 0.0000000 10.110.233.50 172.16.1.1 TCP TCP:Flags=......S., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515135, Ack=0, Win=8192 ( Negotiating scale factor 0x2 ) = 8192
6461 14:33:49 01.05.2012 39.9079944 0.0164360 172.16.1.1 10.110.233.50 TCP TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=0, Seq=2180033885, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
6462 14:33:49 01.05.2012 39.9084181 0.0004237 10.110.233.50 172.16.1.1 TCP TCP:Flags=...A...., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515136, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
6463 14:33:49 01.05.2012 39.9103936 0.0019755 10.110.233.50 172.16.1.1 TLS TLS:TLS Rec Layer-1 HandShake: Client Hello.
6489 14:33:49 01.05.2012 40.2259991 0.3156055 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #6463]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
6614 14:33:50 01.05.2012 40.8343158 0.6083167 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #6463]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
6806 14:33:51 01.05.2012 42.0457229 1.2114071 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #6463]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
6872 14:33:52 01.05.2012 43.1311020 1.0853791 172.16.1.1 10.110.233.50 TCP TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=0, Seq=2180033885, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
6873 14:33:52 01.05.2012 43.1314010 0.0002990 10.110.233.50 172.16.1.1 TCP TCP:Flags=...A...., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
6891 14:33:52 01.05.2012 43.2517820 0.1203810 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #6463]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
7110 14:33:54 01.05.2012 44.4514449 1.1996629 10.110.233.50 172.16.1.1 TCP TCP:[ReTransmit #6463]Flags=...AP..., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=127, Seq=367515136 - 367515263, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
7481 14:33:55 01.05.2012 45.9360680 1.4846231 172.16.1.1 10.110.233.50 TCP TCP:Flags=...A.R.., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=0, Seq=2180033886, Ack=367515136, Win=8192 (scale factor 0x0) = 8192
- TMG server doesn’t really send responses back to the client
3. Then I decided to check the network trace collected on the external interface of the TMG server:
21 14:33:49 01.05.2012 39.6940232 0.0000000 10.110.235.202 172.16.1.1 TCP TCP:Flags=......S., SrcPort=55073, DstPort=HTTPS(443), PayloadLen=0, Seq=367515135, Ack=0, Win=8192 ( Negotiating scale factor 0x2 ) = 8192
22 14:33:49 01.05.2012 39.7097097 0.0156865 172.16.1.1 10.110.235.202 TCP TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=55073, PayloadLen=0, Seq=2180033885, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
23 14:33:52 01.05.2012 42.9329903 3.2232806 172.16.1.1 10.110.235.202 TCP TCP:Flags=...A..S., SrcPort=HTTPS(443), DstPort=55073, PayloadLen=0, Seq=2180033885, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
24 14:33:55 01.05.2012 45.7379523 2.8049620 172.16.1.1 10.110.235.202 TCP TCP:Flags=...A.R.., SrcPort=HTTPS(443), DstPort=55073, PayloadLen=0, Seq=2180033886, Ack=367515136, Win=8192 (scale factor 0x0) = 8192
- External Web server sends a response to initial TCP 3-way handshake request that is forwarded by TMG, but TMG server doesn’t proceed with the connection
4. Then I checked the Web Proxy/Firewall log on the TMG server:
01.05.2012 14:33 Denied 10.110.233.50 50183 172.16.1.1 443 0xc0040034 FWX_E_SEQ_ACK_MISMATCH
When I check more details on that error code, I see that we fail because we receive a TCP packet with an invalid sequence number:
http://msdn.microsoft.com/en-us/library/ms812624.aspx/
FWX_E_SEQ_ACK_MISMATCH 0xC0040034 A TCP packet was rejected because it has an invalid sequence number or an invalid acknowledgement number.
So TMG Server drops the TCP ACK packet (3rd TCP packet in TCP 3-way handshake) coming from the client because it has an invalid TCP ACK number
5. The problem was also visible in the ETL trace:
…
… handshake packet is dropped becuase ACK (2180033888) no equal ISN(peer)+1 (2180033886)
… Warning:The packet failed TCP sequence validation
… Warning:The packet is dropped because of SEQ_ACK_MISMATCH
6. When we check the TMG side network trace we see it there:
SequenceNumber: 367515135 (0x15E7D5FF)
SequenceNumber: 2180033885 (0x81F0AD5D)
AcknowledgementNumber: 367515136 (0x15E7D600)
SequenceNumber: 367515136 (0x15E7D600)
AcknowledgementNumber: 2180033888 (0x81F0AD60)
That Acknowledgement number SHOULD HAVE BEEN 2180033886 (0x81F0AD5E)
So TMG ignores the rest of the session (like TLS client hello coming from the client machine)
7. When I check the client side trace, I see that the ACK number in the TCP ACK packet is really set to the wrong value (2180033888) by the client:
Frame: Number = 6462, Captured Frame Length = 60, MediaType = ETHERNET
+ Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-11-22-33-44-55],SourceAddress:[00-12-34-56-78-1B]
+ Ipv4: Src = 10.110.233.50, Dest = 172.16.1.1, Next Protocol = TCP, Packet ID = 24041, Total IP Length = 40
- Tcp: Flags=...A...., SrcPort=50183, DstPort=HTTPS(443), PayloadLen=0, Seq=367515136, Ack=2180033888, Win=64860 (scale factor 0x0) = 64860
SrcPort: 50183
DstPort: HTTPS(443)
+ DataOffset: 80 (0x50)
+ Flags: ...A....
Window: 64860 (scale factor 0x0) = 64860
Checksum: 0x4B5D, Good
UrgentPointer: 0 (0x0)
8. One can think that it’s a problem with TCPIP stack on the client, but when we check the TCP SYN ACK packet (the second TCP packet sent by TMG server before the TCP ACK with wrong sequence number is sent by the client) we see that the client receives that TCP SYN ACK packet with 2 bytes extra data (which is something unusual for a TCP SYN ACK packet - such packets shouldn’t have data just should have TCP header):
Frame: Number = 456, Captured Frame Length = 60, MediaType = ETHERNET
+ DestinationAddress: Microsoft Corporation [00-11-22-33-44-55]
+ SourceAddress: Test Data [00-12-34-56-78-1B]
EthernetType: Internet IP (IPv4), 2048(0x800)
- Ipv4: Src = 172.16.1.1, Dest = 10.110.233.50, Next Protocol = TCP, Packet ID = 1342, Total IP Length = 46
+ Versions: IPv4, Internet Protocol; Header Length = 20
+ DifferentiatedServicesField: DSCP: 0, ECN: 0
TotalLength: 46 (0x2E)
Identification: 1342 (0x53E)
+ FragmentFlags: 0 (0x0)
TimeToLive: 120 (0x78)
NextProtocol: TCP, 6(0x6)
Checksum: 46957 (0xB76D)
SourceAddress: 194.53.208.72
DestinationAddress: 10.110.233.50
- Tcp: Flags=...A..S., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=2, Seq=2180033885 - 2180033888, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
SrcPort: HTTPS(443)
DstPort: 50183
+ DataOffset: 96 (0x60)
+ Flags: ...A..S.
Window: 64240 ( Scale factor not supported ) = 64240
Checksum: 0x365C, Good
- TCPOptions:
- MaxSegmentSize: 1
type: Maximum Segment Size. 2(0x2)
OptionLength: 4 (0x4)
MaxSegmentSize: 1380 (0x564)
- TCPPayload: SourcePort = 443, DestinationPort = 50183
UnknownData: Binary Large Object (2 Bytes)
That’s why the TCPIP stack running on the client sends a TCP ACK number which is 2 more than the value that should be:
ACK number sent by the client: AcknowledgementNumber: 2180033888 (0x81F0AD60)
The correct ACK number that should have been sent: AcknowledgementNumber: 2180033886 (0x81F0AD5E)
9. And when we check the TCP SYN ACK packet that is leaving the TMG server, we don’t see such an extra 2 bytes:
Frame: Number = 6461, Captured Frame Length = 60, MediaType = ETHERNET
+ Ipv4: Src = 172.16.1.1, Dest = 10.110.233.50, Next Protocol = TCP, Packet ID = 1342, Total IP Length = 44
- Tcp: Flags=...A..S., SrcPort=HTTPS(443), DstPort=50183, PayloadLen=0, Seq=2180033885, Ack=367515136, Win=64240 ( Scale factor not supported ) = 64240
Checksum: 0x365E, Good
So that 2 bytes extra data is somehow added to TCP SYN ACK by something else (like the NIC driver on the TMG server, a network device running in between (like Wireless access point etc) or the NIC driver on the client machine
SUMMARY:
==========
In summary the HTTPS connectivity problem stems from an issue between the Client and TMG server (including the NIC layer or below on the client and server and the network devices/links in between the two)
My customer informed me that the issue was visible with any clients which makes it unlikely that it’s a client side issue. I advised my customer to update NIC drivers on the TMG server side and check the network devices running in the path and upgrade firmwares where possible.
Hope this helps
Thanks,
Murat
In the “network analysis tricks” series of posts, I’ll try to explain some techniques that I use when analyzing network traces.
In this first post, I would like to explain how I find all TCP connection attempts in a network trace.
To see all TCP connection attempts in a network trace, you can use the following filter: (applies to both Network Monitor and Wireshark)
tcp.flags.syn==1
Network Monitor output:
Wireshark output:
After we apply the filter, we see that 10.1.1.4 tries to open different TCP sessions to 10.1.1.3 at TCP port 135, 1064, 3028 and 3029. Please note that the filter only displays the first two packets of each TCP session because only the first two TCP segments have the SYN flag set to 1. After finding out the TCP sessions, you can see the whole TCP session with a new filter.
For example, if we need to see the all TCP packets in the TCP session between 10.1.1.4:3026 <--> 10.1.1.3:135, we can apply the following filter and see all the packets exchanged in that session:
You can extend the filter to limit the output. Please see some examples below:
=> This filter will show all TCP connection attempts to TCP port 80 (HTTP)
tcp.flags.syn==1 and tcp.port==80
=> This filter will show all TCP connection attempts to TCP port 80 (HTTP) between 10.1.1.3 and 10.1.1.4
tcp.flags.syn==1 and tcp.port==80 and ip.addr==10.1.1.3 and ip.addr==10.1.1.4 (for Wireshark)
tcp.flags.syn==1 and tcp.port==80 and ip.addr==10.1.1.3 and ip.addr==10.1.1.4 (for Network monitor)
If you would like to see all TCP sessions ongoing in a network trace (not just the ones that was established within the timeframe covered by the trace) with some useful information, you can use a nice Wireshark feature called “Conversations”.
You first need to select "Conversations" from "Statistics" menu in Wireshark:
Then you need to click on "TCP" tab in the "Conversations" window
So we can see the same four TCP sessions that were seen with the previous filter. On top of that we can see many details for each session like how many packets/bytes were exchanged in each direction, the duration of a given TCP session seen in the trace, and approximate bandwidth usage of a given TCP session.
When troubleshooting network connectivity/performance problems, I very oftenly use the above techniques to find out different TCP sessions because an application/service might try to establish multiple TCP sessions when doing its work.
In this blog post, I’ll be discussing an Exchange 2010 publishing on TMG 2010 issue. The problem was that after sometime Outlook clients were failing to access to Exchange 2010 server via RPC over HTTPS.
There may be many different reasons for Exchange publishing problems, but in this case everything seemed green in the “Test Rule” output that we ran on TMG server. So from TMG server perspective everything seemed good. The output I mention is similar to below one:
Note: This is taken from excellent tutorial written by Richard Hicks
http://www.isaserver.org/tutorials/Publishing-Exchange-Outlook-Web-App-OWA-Microsoft-Forefront-Threat-Management-Gateway-TMG-2010-Part2.html
So we decided to collect TMG data packager to better understand what was going on while the problem occurred. In the TMG data packager output we realized that some of the incoming TCP connections were being dropped by TMG server with the error QUOTA EXCEEDED.
Indeed, we saw that TMG server didn’t respond to some TCP SYNs and the reason was that the flood mitigation mechanism was triggered:
Normally a remote Outlook client will establish more than one TCP connection but it shouldn’t trigger flood mitigation under normal circumstances. Something in the above screen shot might have drawn your attention:
All incoming TCP connections come from the same source IP address: 192.168.200.1
Actually that was the key point in understanding the root cause of the problem. As per customer’s network setup, the router running in front of the TMG server (by the way, TMG server was running on a single NIC in the DMZ network) was NATting source IP address of all external systems and that was why we saw that all incoming connections were coming from the same IP address. After turning flood mitigation off on TMG server, the issue was resolved (The customer increased the limits later on but in general it looks like it doesn’t make much sense using flood mitigation since all legitimate traffic is received from the same IP address which is 192.168.200.1). I also would like to thank to Franck who helped me a lot while dealing with this issue.
In this post, I would like to talk about some important points about network capturing. If a network trace is not collected appropriately, it won’t provide any useful information and it will be a waste of time analyzing such a network trace.
Additionally, just collecting the network trace isn’t sufficient if you intend to ask for some help when analyzing that network trace, you also have to provide some information about the trace itself. Generally I collaborate with other colleagues in terms of network trace analysis and I have a standard template of questions when I’m approached by a colleague for assistance in analyzing a network trace:
- What is the exact problem definition
- Which network traces were collected on which system
- The IP addresses of the relevant systems (like client/server/DC/DNS)
- OS versions for relevant systems
- Network topology between the source and target systems on which network traces were collected
- The exact date & time of the problem & error seen
- The exact error message seen
- What were the exact actions taken when collecting the network traces (as much detailed as possible)
Now let’s talk about some important points that we need to be aware of to be able to collect a useable network trace that will really help you troubleshoot a given problem.
1) First of all, we need to make sure that it really makes sense to collect a network trace for the problem in hand. You can check the previous blog post to have a better idea on this:
http://blogs.technet.com/b/nettracer/archive/2012/06/22/when-do-we-need-collect-network-traces.aspx
2) Especially in switched networks, when we collect a network trace from a given node (a client or server), only the following traffic will be seen by the capturing agent (like Network monitor/Wireshark/...) running on the node:
- Packets sent out by the node itself
- Packets sent to the node’s unicast address
- Packets sent to unknown unicast addresses (switch doesn’t have that MAC address at its MAC address table yet so it floods the frame everywhere)
- Packets sent to broadcast address
- Packets sent to multicast addresses
So we won’t be able to see the packets sent to/received from client2 in a network trace collected on client1. If you really have to see the packets sent to/received from a node other than the node on which network trace is collected, you have to do port mirroring configuration (and your LAN switch should support it as well). Most of the LAN switches used in enterprise networks support port mirroring. You can see below a link for making such a configuration on Cisco LAN switches:
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a008015c612.shtml
Switched Port Analyzer (SPAN) Configuration Example
3) If you troubleshoot a communication or performance problem between two processes running on the same node, that traffic won’t leave the machine and hence the network traffic won’t be captured by the capturing agent (Network Monitor/Wireshark). The traffic will be looped back by TCPIP stack. As an example, you won’t be able to see the network activity taking place between Internet Explorer and the Web server running on the same machine. If you need to troubleshoot such a scenario, you might try to collect an ETL trace, but the node will have to be running Windows 7 or Windows 2008 R2 for that. Please see the following post for more details on collecting such an ETL trace:
http://blogs.technet.com/b/nettracer/archive/2010/10/06/how-it-works-under-the-hood-a-closer-look-at-tcpip-and-winsock-etl-tracing-on-windows-7-and-windows-2008-r2-with-an-example.aspx
4) When collecting a network trace from a busy server, a capture filter might be applied to minimize the amount of traffic captured. We generally don’t prefer to capturing network traffic with a capture filter because when such a capture filter is applied, we take the risk of excluding some of the traffic that might be really relevant to the issue. If you’re really sure about what you have to check, then you may want to apply such a filter. You can find below an example of capturing with a filter with nmcap (command line version of Network monitor)
Note: The following is taken from nmcap /examples output:
This example starts capturing network frames that DO NOT contain ARPs, ICMP, NBtNs and BROWSER frames. If you want to stop capturing, Press Control+C.
nmcap /network * /capture (!ARP AND !ICMP AND !NBTNS AND !BROWSER) /File NoNoise.cap
5) If you really need to capture network traffic from a very busy server and you don’t want to take the risk of excluding some network traffic that might be relevant, you might want to capture let’s say only the first 256 bytes of each packets. Considering that a standard ethernet frame is about 1500 bytes, this will provide you a saving of ~%80. You can find an example for nmcap where only the first 256 bytes of each packet is captured:
nmcap /network * /MaxFrameLength 256
6) If network traces will be collected for an extended period, capturing all packets inside the same file will make it nearly impossible to analyze it (example: 5 GB network trace). To be able to collect manageable and analyzable network traces, it’s suggested to collect chained and fragmented network traces. You can find below an example for nmcap again:
nmcap /network * /capture /file ServerTest.chn:200M
Note: nmcap will create a new capture file once the first one if full (200 MB) and so on. So please make sure that you have enough free disk space on the related drive.
Note: The traces created will be named as ServerTest.cap, ServerTest(1).cap, ServerTest(2).cap,...
7) If you have to collect network traces for an unspecified period of time and you would like to see some activity taking place some time before the problem, you may have to collect network traces in a circular fashion which is possible with dumpcap (command line version of Wireshark for trace collection). You can see an example below:
dumpcap -i 2 -w c:\traces\servername.pcap -b filesize:204800 -b files:80
Notes 1: interface id "2" will be monitored and each capture file will be 204800 KB (200 MB)
Notes 2: The command assumes that c:\traces folder already exists. Also please make sure that there's enough free space on that drive (C: in this instance). 16 GB's of free space will be required to create and save 80 x 200 MB traces.
Notes 3: Eighty different files will be created with "servername_0000n_Date&time.pcap" syntax.
Example:
servername_00001_20120622134811.pcap
servername_00002_20120622135617.pcap
servername_00003_20120622141512.pcap
.
Notes 4: When all eighty files are created and full, it will start overwriting starting from the oldest trace file
Notes 5: Trace could be stopped any time by pressing Ctrl+C
8) It’s important to mark network traces with pings to be able to narrow down the time period that you need focus on in the trace. For example, you can ping the default gateway of the client just before and right after reproducing the problem.
Example1:
<<Start network trace on the client>>
ping -l 22 -n 2 IP-address-of-default-gateway
<<Reproduce the problem now. Example: Try to connect to www.microsoft.com from IE and once you get the “page not found” run the second ping>>
ping -l 33 -n 2 IP-address-of-default-gateway
<<Stop network trace on the client>>
Example2:
ping -l 22 -n 5 IP-address-of-the-file-server
start > run > \\server\share
<<assuming that it takes 5+ seconds to open up the share content. Once the share content is listed, please run the below command>>
ping -l 33 -n 5 IP-address-of-the-file-server
<<Please write down the following information : the exact date&time of this test / how long it took to display the share content / exact \\server\share that you accessed >>
dir \\server\share
<<Please write down the following information : how long it took to display the share content when you used "dir" command>>
ping -l 44 -n 5 IP-address-of-the-file-server
When you start analyzing a network trace collected in that fashion, you can easily focus on a certain range of packets in the trace. Example:
Packet1
Packet2
Packet3
Packet4
Packet5
<<22 bytes ICMP echo request>>
Packet6
Packet7
Packet8
Packet9
Packet10
<<33 bytes ICMP echo request>>
Packet11
Packet12
...
We know that the issue was reproduced between 22 and 33 bytes ping markers, we can only focus on the activity taking place between packet #6 and packet # 10. Consider that it was a 50000 packets trace, you now isolated the problem down to 5 packets. (you may not be always lucky that much J)
You might be wondering "how can I identify those 22 and 33 bytes ICMP packets in the network trace". Here's a trick that I generally use. I first apply the following Wireshark filters in the network trace:
ip.len==50 and icmp (to identify the 22 bytes ping)
ip.len==61 and icmp (to identify the 33 bytes ping)
9) One of the most important points that you need to take into consideration is collecting simultaneous network traces where possible. With “simultaneous network traces” I mean “collecting a network trace on the source and on the target systems at the same time”. That may not be always possible especially if one one of those systems is not controlled by you (example you’re troubleshooting a connectivity problem to a web site that belongs to another company)
Other than that, I cannot stress more how important it’s to collect simultaneous network traces. When troubleshooting network connectivity issues, you cannot conclude whether or not the target server received the packet, or it sent a response back to the source or the source received the response without simultaneous network traces. Similarly, in network performance issues, you cannot conclude whether or not the response delay stems from the network path in between or from target/source systems. Let me try to explain what I mean with a couple of examples:
We look at a client side network trace and see that the client sends 3 x TCP SYN segments to target without a response:
No. Time Delta Source Destination Protocol Info
141154 2011-03-31 16:52:29.488847 0.000000 192.168.4.71 10.1.1.1 TCP 37389 > 443 [SYN] Seq=0 Win=65535 Len=0
141158 2011-03-31 16:52:29.488847 0.000000 192.168.4.71 10.1.1.1 TCP 37389 > 443 [SYN] Seq=0 Win=65535 Len=0
144808 2011-03-31 16:52:29.801347 0.312500 192.168.4.71 10.1.1.1 TCP 37389 > 80 [SYN] Seq=0 Win=65535 Len=0
By looking at the client side trace, can you answer the following?
=> Did the target server really receive the above 3 TCP SYN segments?
=> Did the target server send a response back to the above TCP SYN segment?
=> Did the target server really send the response and we didn’t see it at the client side?
All the answers are NO. You cannot say if the target server really received those TCP SYNs or received and sent a response back or didn’t send any response at all. To be able to correctly answer those questions, you will have to see the story from target server’s perspective by looking at a network trace collected on that system.
We look at a client side network trace and see that HTTP response is sent by the HTTP server after 4 seconds:
Time Delta Source Destination Protocol Info
16:57:37.537895 0.000000 192.168.4.71 10.17.200.49 TCP 45221 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM=1
16:57:37.787895 0.250000 192.168.4.71 10.17.200.49 TCP 45221 > 80 [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT]
16:57:37.787895 0.000000 10.17.200.49 192.168.4.71 TCP 80 > 45221 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1380
16:57:37.787895 0.000000 192.168.4.71 10.17.200.49 HTTP GET /images/downloads/cartoons/thumb_1.jpg HTTP/1.1
16:57:38.053520 0.265625 10.17.200.49 192.168.4.71 TCP 80 > 45221 [ACK] Seq=1 Ack=356 Win=6432 Len=0
16:57:42.084770 4.031250 10.17.200.49 192.168.4.71 HTTP HTTP/1.1 200 OK (JPEG JFIF image)
16:57:42.084770 0.000000 10.17.200.49 192.168.4.71 HTTP Continuation or non-HTTP traffic
16:57:42.084770 0.000000 192.168.4.71 10.17.200.49 TCP 45221 > 80 [ACK] Seq=356 Ack=2761 Win=65535 [TCP CHECKSUM
16:57:42.350395 0.265625 10.17.200.49 192.168.4.71 HTTP Continuation or non-HTTP traffic
....
=> Does the 4 second delay come from the target server or a network device running in between?
=> Did the target server wait for 4 seconds before responding or did it immediately send a response back but we see it after 4 seconds at the client side?
All the answers are NO. You cannot say if that 4 seconds delay really comes from the target web server or network device (web proxy for example) running in between. To be able to correctly answer those questions, you will have to see the story from target server’s perspective by looking at a network trace collected on that system.
In the second post of “network analysis tricks” series, I’ll explain how to focus on a certain range of packets in a network trace.
When I ask for a network trace from a customer, I almost always ask for ICMP markers before and after reproducing the problem. You can see an example action plan below to see what I mean:
<<Start network trace on the client>>ping -l 22 -n 1 IP-address-of-default-gateway<<Reproduce the problem now. Example: Try to connect to www.microsoft.com from IE and once you get the “page not found” run the second ping>>ping -l 33 -n 1 IP-address-of-default-gateway<<Stop network trace on the client>>
Note: You can check the following blog post for more information about general network trace collection tricks
http://blogs.technet.com/b/nettracer/archive/2012/06/22/network-traffic-capturing-hints.aspx
In our example, when the network trace is collected as per the above action plan, it’s very clear that we will only be focusing on packets that were received/sent between 22 bytes and 33 bytes ICMP packets sent to the default gateway and we will safely ignore packets before 22 bytes ICMP packet or packets after 33 bytes ICMP packet. So if the 22 bytes and 33 bytes ICMP requests and responses are as given below:
Packet6 <<22 bytes ICMP echo request>>
Packet9 <<22 bytes ICMP echo response>>
Packet13
Packet14
Packet15
Packet16
Packet17
Packet18
Packet19
Packet20 <<33 bytes ICMP echo request>>
Packet21
Packet22
Packet23 <<33 bytes ICMP echo request>>
Packet24
Packet25
Packet26
Packet27
it will mean we will only need to focus on packets between packet10 and packet19 and can safely ignore all packets after packet19 or all packets before packet10.
You might be wondering how you can identify those 22 bytes or 33 bytes ICMP packets in a network trace. It’s fairly simple, we filter the network trace based on icmp protocol and ip packet length. For example the following Wireshark filter will show us all ICMP packets whose payloads are 22 bytes or 33 bytes:
icmp and (ip.len==50 or ip.len==61)
=> Total length in IP header is set to 50 when we send a 22 bytes ping request:
20 bytes IP header + 8 bytes ICMP header + 22 bytes ICMP payload
=> Total length in IP header is set to 61 when we send a 33 bytes ping request:
20 bytes IP header + 8 bytes ICMP header + 33 bytes ICMP payload
That would mean that we need to focus on frames between frame #126 and frame #210 since we reproduced the problem between 22 bytes and 33 bytes pings as given below:
Of course we’ll need to do further filtering based on the problem we’re troubleshooting, but for now we managed to isolate the traffic that we need to analyze to a certain range.
From this point on, whatever filter we apply to the network trace, we can put the following filter in front of it:
frame.number>=126 and frame.number<=210
Let’s say that we troubleshoot a TCP connection problem to TCP port 389 in this network trace. Then we can simply apply the following filter (in Wireshark):
as we won’t be interested in any packets sent/received to TCP port 389 before frame#126 or after frame#210 (because those frames are sent/received before we reproduce the problem and after we reproduce the problem)
Many Microsoft support engineers dealing with customer technical issues ask for network traces to further troubleshoot and isolate a given problem. In this post I wanted to give you an idea about when we generally ask for a network trace so that you might want to take a similar approach for similar problems.
May be we can start with the question “When do we need to collect and analyze a network trace?”
Even though the answers might vary, generally you will need to collect and analyze a network trace in the below situations:
• You need to troubleshoot network connectivity problems
• You need to troubleshoot network performance problems
• You need to troubleshoot network security problems
• You need to understand network behavior of protocols and applications for baselining or capacity planning purposes
Also you can see below some example problem types that we get from our customers where we ask for network traces:
Network connectivity problem examples:
• Web browser cannot connect to Web server
• Remote share cannot be browsed
• Event Viewer cannot connect to remote event log service
• I get ‘RPC server is unavailable’ when I initiate AD replication
• We get ‘server not found’ error when starting the XYZ application (a 3rd party app)
• Exchange server doesn’t receive e-mails from the internet
• Sharepoint portal cannot be reached from clients in a certain site
• Sharepoint server cannot retrieve data from SQL server
• SCCM server cannot communicate with the SCCM agent
• 3rd party client application cannot connect to 3rd party server application over a VPN tunnel
Network performance problem examples:
• File copy between two servers takes too long
• Download through HTTP from the internet takes is slow
• Backing up one of our file servers through the network takes too long
• We see delays in browsing our web site
• FTP file transfers are too slow between certain sites
• Windows Explorer is too slow in showing the remote share content
• SQL server query performance over the WAN connections is too slow
• Outbound e-mails queue up on Exchange Edge server
• Outlook client cannot connect to Exchange CAS servers trough a load balancer
Network security problem examples:
• We would like to understand why File server1 tries to establish a session to 10.1.1.1 through TCP port 7789
• In our firewall logs, we see that certain clients try to access a certain site. Why do those clients try to access that site?
• We would like to see which process generates a specific TCP session
• We would like to see the authentication protocol that the clients use to authenticate to Server X
• Kerberos/NTLM authentication problems
• Certain SSL authentication issues
• As soon as we connect the client machine to a switchport, the switchport is disabled due to excessive traffic coming from the client. We would like to know the reason behind that