Hi there,
In this blog post, I’ll talk about another network trace analysis scenario.
The problem was that some Windows XP clients were copying files from a NAS device very slowly compared to others. As one of the most useful logs to troubleshoot such problems, I requested a network trace to be collected on a problem Windows XP client. Normally it’s best to collect simultaneous network traces but it was a bit diffcult to collect a trace at the NAS device side so we were limited to a client side trace.
Before I start explaining how I got to the bottom of the issue, I would like to provide you with some background on how files are read by Windows via SMB protocol so that you’ll better understand the resolution part:
Windows XP and Windows 2003 use SMB v1 protocol for remote file system access (like creating/reading/writing/deleting/locking files over a network connection). Since it was a file read from the remote server in this scenario, the following SMB activity would be seen between the client and server:
Client Server
===== ======
The client will open the file at the server first:
SMB Create AndX request ---->
<---- SMB Create AndX response
Then the client will send SMB Read AndX requests to retrieve blocks of the file:
SMB Read AndX request ----> (61440 bytes)
<---- SMB Read AndX response
...
Note: SMBv1 protocol could request 61 KB of data at most in one SMB Read AndX request.
After this short overview, let’s get back to the original problem and analyze the packets taken from the real network trace:
Frame# Time delta Source IP Destination IP Protocol Information
===== ======== ========= ========== ====== ========
....
59269 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263823360
59270 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes
59271 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65993793
59272 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65995249
59273 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65996705
59320 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66049121
59321 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66050577
59322 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66052033
59323 0.000000 10.1.1.1 10.1.1.2 TCP foliocorp > microsoft-ds [ACK] Seq=67600 Ack=66053489 Win=65535
59325 0.406250 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > folioc [PSH, ACK]Seq=66053489
59326 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263884800
59327 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes
59328 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59327] microsoft-ds > foliocorp [ACK] Seq=66055297
Now let’s take a closer look at some related frames:
Frame# 59269 => Client requests the next 61 KB of data at offset 263823360 from the file represented with FID 0x0494 (this FID is assigned by server side when the file is first opened/created)
Frame# 59270 => Server starts sending 61440 bytes of data back to the client in SMB Read AndX response.
Frame# 59271 => The remaining parts are sent in 1460 bytes chunks because of TCP MSS negotiated, it’s generally 1460 bytes. (like frame# 59272, frame# 59273 etc)
The most noticable thing in the network trace was to see many such 0.4 seconds delays (like the one that we see at frame #59325). Those 0.4 seconds delays were always present at the last fragment of 61 KB of data returned by the server.
Normally 0.4 seconds could be seen as a very low delay but considering that the client will send n x SMB Read Andx request to the server to read the file it will quickly be clear that 0.4 seconds of delay is huge (for example, the client needs to send 1000 SMB Read AndX requests to read a 64 MB file)
Generally we’re used to see some delays in network traces due to packet retransmissions (due to packet loss) or link transfer delayes etc. But seeing a constant delay of 0.4 seconds in every last fragment of a 61 KB block made me suspect that a QoS implementation was in place somewhere between the client and server. By delaying every read request about 0.4 seconds, actually file copy is being slowed down on purpose: traffic shaping/limiting.
Since we didn’t have a network trace collected at the NAS device side, we couldn’t check if the QoS policy was in effect at the NAS device side or on a network device running in between the two. (we checked the client side and there was no QoS configuration in place). After further checking the network devices, it turned out that there was an incorrectly configured QoS policy on one of them. After making the required changes, the problem was resolved...
Hope this helps
Thanks,
Murat
In today’s blog post, I’m going to show you how I found out why a Domain controller was contacting random clients in the domain. This issue was reported by the customer due to security concerns. They suspected that a suspicious process might be running on the DC and the case was raised as a result of security concerns. In general we don’t expect Domain controllers to contact the clients running in the domain so our customer wanted to understand the reason behind that.
We first verified that the DC was really contacting some clients by collecting a network trace on the DC. You can see one of those clients (client1) contacted by the DC (DC1):
Note: DC and client IP addresses are replaced for data privacy.
11415 14:21:12 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=3912, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=70128947, Ack=0, Win=65535 ( ) = 65535 {TCP:515, IPv4:46}
11443 14:21:12 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=3913, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3133793441, Ack=0, Win=65535 ( ) = 65535 {TCP:518, IPv4:46}
30922 14:33:17 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4118, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2414564040, Ack=0, Win=65535 ( ) = 65535 {TCP:1270, IPv4:46}
30950 14:33:17 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4120, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1797119693, Ack=0, Win=65535 ( ) = 65535 {TCP:1273, IPv4:46}
51472 14:45:22 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4314, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1834145861, Ack=0, Win=65535 ( ) = 65535 {TCP:1403, IPv4:46}
51500 14:45:22 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4315, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=4278939251, Ack=0, Win=65535 ( ) = 65535 {TCP:1406, IPv4:46}
67096 14:57:26 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4514, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1707963693, Ack=0, Win=65535 ( ) = 65535 {TCP:1945, IPv4:46}
67126 14:57:26 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4515, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3807245641, Ack=0, Win=65535 ( ) = 65535 {TCP:1948, IPv4:46}
74691 15:09:30 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4740, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1036190517, Ack=0, Win=65535 ( ) = 65535 {TCP:1983, IPv4:46}
74721 15:09:31 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4741, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2281072822, Ack=0, Win=65535 ( ) = 65535 {TCP:1986, IPv4:46}
84937 15:21:35 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4930, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3190224054, Ack=0, Win=65535 ( ) = 65535 {TCP:2104, IPv4:46}
84965 15:21:35 05.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=4931, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2774224583, Ack=0, Win=65535 ( ) = 65535 {TCP:2107, IPv4:46}
At first look, it drew my attention that the connection attempt was repeated every 12 minutes or so. Then this should have been something running periodically on the DC. Normally Network Monitor should show you the process that is initiating those TCP sessions but under heavy load Network monitor stops to do so in favor of performance as it’s a costly operation. There’re some other methods to find out a process sending a certain packet but I decided to let the DC do whatever it would do against the client to see the whole activity.
So the customer removed firewall filters and allowed the DC to connect to Client1. After doing so we collected a new network trace to see the latest situation. We got the expected results by examining the new network trace:
a) The first interesting finding was that the client was sending a “Master Browser” announcement to the DC (DC1) shortly before one of these connection attempts from the DC side:
47140 07:30:31 08.07.2010 CLIENT1 DC1 BROWSER BROWSER:Master Announcement {SMB:351, UDP:350, IPv4:3}
b) After that browser announcement, the DC contacted the client at TCP port 139 to establish an SMB session:
47595 07:30:33 08.07.2010 DC1 CLIENT1 TCP TCP:Flags=......S., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594372577, Ack=0, Win=65535 ( ) = 65535 {TCP:373, IPv4:3}
47596 07:30:33 08.07.2010 CLIENT1 DC1 TCP TCP:Flags=...A..S., SrcPort=NETBIOS Session Service(139), DstPort=3787, PayloadLen=0, Seq=2981880191, Ack=2594372578, Win=8192 ( Scale factor not supported ) = 8192 {TCP:373, IPv4:3}
47597 07:30:33 08.07.2010 DC1 CLIENT1 TCP TCP:Flags=...A...., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594372578, Ack=2981880192, Win=65535 (scale factor 0x0) = 65535 {TCP:373, IPv4:3}
c) Then it initiated a NetBT session to the client:
47598 07:30:33 08.07.2010 DC1 CLIENT1 NbtSS NbtSS:SESSION REQUEST, Length =68 {NbtSS:374, TCP:373, IPv4:3}
47599 07:30:33 08.07.2010 CLIENT1 DC1 NbtSS NbtSS:POSITIVE SESSION RESPONSE, Length =0 {NbtSS:374, TCP:373, IPv4:3}
d) Then it established an SMB connection:
47600 07:30:33 08.07.2010 DC1 CLIENT1 SMB SMB:C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12 {NbtSS:374, TCP:373, IPv4:3}
47602 07:30:33 08.07.2010 CLIENT1 DC1 SMB SMB:R; Negotiate, Dialect is NT LM 0.12 (#5), SpnegoToken (1.3.6.1.5.5.2) {NbtSS:374, TCP:373, IPv4:3}
47614 07:30:34 08.07.2010 DC1 CLIENT1 SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE {NbtSS:374, TCP:373, IPv4:3}
47615 07:30:34 08.07.2010 CLIENT1 DC1 SMB SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED {NbtSS:374, TCP:373, IPv4:3}
47616 07:30:34 08.07.2010 DC1 CLIENT1 SMB SMB:C; Session Setup Andx, NTLM AUTHENTICATE MESSAGE, Workstation: DC1 {NbtSS:374, TCP:373, IPv4:3}
47621 07:30:34 08.07.2010 CLIENT1 DC1 SMB SMB:R; Session Setup Andx {NbtSS:374, TCP:373, IPv4:3}
e) Then it connected to the interprocess communication share (IPC$):
47625 07:30:34 08.07.2010 DC1 CLIENT1 SMB SMB:C; Tree Connect Andx, Path = \\CLIENT1\IPC$, Service = ????? {NbtSS:374, TCP:373, IPv4:3}
47626 07:30:34 08.07.2010 CLIENT1 DC1 SMB SMB:R; Tree Connect Andx, Service = IPC {NbtSS:374, TCP:373, IPv4:3}
f) Then it called RAP (Remote Administration Protocol) APIs like NetServerEnum2 etc:
47630 07:30:34 08.07.2010 DC1 CLIENT1 RAP RAP:NetServerEnum2 Request, InfoLevel = 1, LocalList in {SMB:379, NbtSS:374, TCP:373, IPv4:3}
47631 07:30:34 08.07.2010 CLIENT1 DC1 RAP RAP:NetServerEnum2 Response, Count = 1 {SMB:379, NbtSS:374, TCP:373, IPv4:3}
g) Once it got the requested info, it logged off and disconnected the TCP session:
47642 07:30:34 08.07.2010 DC1 CLIENT1 SMB SMB:C; Logoff Andx {NbtSS:374, TCP:373, IPv4:3}
47643 07:30:34 08.07.2010 CLIENT1 DC1 SMB SMB:R; Logoff Andx {NbtSS:374, TCP:373, IPv4:3}
47650 07:30:34 08.07.2010 DC1 CLIENT1 SMB SMB:C; Tree Disconnect {NbtSS:374, TCP:373, IPv4:3}
47651 07:30:34 08.07.2010 CLIENT1 DC1 SMB SMB:R; Tree Disconnect {NbtSS:374, TCP:373, IPv4:3}
47657 07:30:34 08.07.2010 DC1 CLIENT1 TCP TCP:Flags=...A...F, SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594373651, Ack=2981881320, Win=64407 (scale factor 0x0) = 64407 {TCP:373, IPv4:3}
47658 07:30:34 08.07.2010 CLIENT1 DC1 TCP TCP:Flags=...A...F, SrcPort=NETBIOS Session Service(139), DstPort=3787, PayloadLen=0, Seq=2981881320, Ack=2594373652, Win=15559 (scale factor 0x0) = 15559 {TCP:373, IPv4:3}
47662 07:30:34 08.07.2010 DC1 CLIENT1 TCP TCP:Flags=...A...., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594373652, Ack=2981881321, Win=64407 (scale factor 0x0) = 64407 {TCP:373, IPv4:3}
h) That similar activity was seen every 12 minutes in the network trace.
RESULTS:
========
1) After analyzing the second network trace, the reason behind DC’s connection to clients was clear now:
Every 12 minutes or so master browser in a network segment informs the domain master browser (which is the DC) that it’s a master browser. And then the DC connects to master browser in turn to retrieve the browse list from that master browser. You can find more details below:
Taken from http://technet.microsoft.com/en-us/library/cc737661(WS.10).aspx How Computer Browser Service Works:
When a domain spans multiple subnets, the master browse servers for each subnet use a unicast Master Announcement message to announce themselves to the domain master browse server. This message notifies the domain master browse server that the sending computer is a master browse server in the same domain. When the domain master browse server receives a Master Browse Server Announcement message, it returns to the “announcing” master browse server a request for a list of the server’s in that master browse server’s subnet. When that list is received, the domain master browse server merges it with its own server list.
This process, repeated every 12 minutes, guarantees that the domain master browse server has a complete browse list of all the servers in the domain. Thus, when a client sends a browse request to a backup browse server, the backup browse server can return a list of all the servers in the domain, regardless of the subnet on which those servers are located.
Hi,
I would like to talk about a few network trace analysis cases where we were requested to find out why certain packets (spefically ICMP and UDP) were sent by Exchange servers. You’ll find below more details about how we found the processes sending those packets:
a) Exchange servers sending UDP packets with random source or destination ports to various clients
In one scenario, our customer’s security team wanted to find out the reason of why the Exchange servers were sending UDP packets to random clients on the network because of security concerns. There was no deterministic pattern regarding source or destination UDP ports. The only consistency was that each UDP packet sent by the Exchange servers had always 8 byte as payload. You can see a sample network trace output below:
Note: Addresses were replaced for privacy purposes even though private IP address space was in use.
No. Time Source Destination Protocol Info
105528 2010-01-14 15:20:14.454856 10.1.1.1 172.1.10.14 UDP Source port: 35996 Destination port: mxomss
105530 2010-01-14 15:20:14.454856 10.1.1.1 172.18.10.27 UDP Source port: 35997 Destination port: edtools
105531 2010-01-14 15:20:14.454856 10.1.1.1 172.17.17.95 UDP Source port: 35998 Destination port: fiveacross
105535 2010-01-14 15:20:14.454856 10.1.1.1 172.17.11.51 UDP Source port: 36000 Destination port: kwdb-commn
105540 2010-01-14 15:20:14.454856 10.1.1.1 172.23.98.97 UDP Source port: 36003 Destination port: dicom-tls
105541 2010-01-14 15:20:14.454856 10.1.1.1 172.24.12.8 UDP Source port: 36004 Destination port: dkmessenger
105542 2010-01-14 15:20:14.454856 10.1.1.1 172.28.2.52 UDP Source port: 36005 Destination port: tragic
105545 2010-01-14 15:20:14.454856 10.1.1.1 172.31.5.14 UDP Source port: 36006 Destination port: xds
105546 2010-01-14 15:20:14.454856 10.1.1.1 172.2.10.63 UDP Source port: 36007 Destination port: 4642
105547 2010-01-14 15:20:14.454856 10.1.1.1 172.2.35.68 UDP Source port: 36008 Destination port: foliocorp
105552 2010-01-14 15:20:14.454856 10.1.1.1 172.18.12.55 UDP Source port: 36010 Destination port: saphostctrl
105553 2010-01-14 15:20:14.454856 10.1.1.1 172.48.199.45 UDP Source port: 36011 Destination port: slinkysearch
105554 2010-01-14 15:20:14.454856 10.1.1.1 172.27.133.42 UDP Source port: 36012 Destination port: oracle-oms
105555 2010-01-14 15:20:14.454856 10.1.1.1 172.27.121.40 UDP Source port: 36013 Destination port: proxy-gateway
105558 2010-01-14 15:20:14.454856 10.1.1.1 172.24.7.11 UDP Source port: 36016 Destination port: fcmsys
- Source UDP port is increasing and destination UDP port seems random at first sight
- The data part of the UDP datagrams are always 8 bytes. As an example:
Frame 105540 (50 bytes on wire, 50 bytes captured)
Ethernet II, Src: HewlettP_11:11:11 (00:1c:c4:11:11:11), Dst: All-HSRP-routers_15 (00:00:0c:07:ac:15)
Internet Protocol, Src: 10.1.1.1 (10.1.1.1), Dst: 172.23.98.97 (172.23.98.97)
User Datagram Protocol, Src Port: 36003 (36003), Dst Port: dicom-tls (2762)
Data (8 bytes)
=> To better understand which process might be sending that packet, we decided to collect a kernel TCPIP trace on the source Windows 2003 server. You can find more information about methods that could be used to identify the process sending a certain packet, please see my previous post on this
After collecting a network trace and an accompanying kernel TCPIP trace as described in the other post (option 4), we managed to catch the UDP packet that we see in the above network trace (actually the above network trace and the below kernel TCPIP trace were collected together). As an example:
UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.023.098.097, 010.001.001.001, 2762, 36003, 0, 0
UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.027.153.050, 172.023.021.024, 6004, 36009, 0, 0
UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.028.097.111, 172.023.021.024, 2344, 36016, 0, 0
UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.027.102.056, 172.023.021.024, 1116, 36022, 0, 0
- For example, in the red line, 10.1.1.1 (Exchange server) is sending a UDP packet to 172.23.98.97. Packet lenght is 8 bytes and source UDP port is 36003 and destination UDP port is 2762. And process ID that is sending the UDP packet is 2136. Actually in all such UDP packets, process ID is always 2136.
- The above line in Red taken from the kernel trace is the packet #105540 seen in the network trace
=> After checking the “tasklist /SVC” output, we saw that process ID 2136 was store.exe (which is Exchange Information store process):
wmiprvse.exe 5176 Console 0 2,168 K
mad.exe 7176 Console 0 45,792 K
AntigenStore.exe 10092 Console 0 200 K
store.exe 2136 Console 0 1,040,592 K
emsmta.exe 12020 Console 0 29,092 K
=> After further investigation at Exchange side with the help of an Exchange expert, we found out that this traffic was expected and was used as an E-mail notification mechanism:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;811061 XCCC: Exchange Clients Do Not Receive "New Mail" Notification Messages
The Information Store process (Store.exe) sends a User Datagram Protocol (UDP) packet for new mail notifications. However, because the Store process does not run on an Exchange virtual server but on the cluster node, the UDP packet is sent from the IP address of that node. If you fail over the cluster node, the data and Exchange 2000 Server virtual server configuration are moved to the Store process that is running on the other cluster server node. New mail notifications are sent from the IP address of that second cluster node.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
b) Exchange servers sending 1 byte pings to DCs
One of our customers reported that their DCs were getting constant ICMP Echo requests from a number of member servers and they wanted to get help in finding the process behind it because of security concerns. After some analysis and testing with the help of an Exchange expert colleague of mine, we found out that those ICMP echo requests were sent by the Exchange server related services. The ICMP echo request has the following characteristics:
=> It’s payload is always 1 byte
=> The payload itself is “3F”
Those ICMP echo requests cease once Exchange server related services are stopped which is another indication. This behavior is partly explained at the following article:
http://support.microsoft.com/kb/270836 Exchange Server static port mappings
Taken from the article:
Note In a perimeter network firewall scenario, there is no Internet Control Message Protocol (ICMP) connectivity between the Exchange server and the domain controllers. By default, Directory Access (DSAccess) uses ICMP to ping each server to which it connects to determine whether the server is available. When there is no ICMP connectivity, Directory Access responds as if every domain controller were unavailable
=> You can also see a sample network trace output collected on an Exchange server:
In today’s blog post, I’m going to talk about how to troubleshoot a WAN performance problem by using simultaneous network traces.
One of customers reported that they were having performance problems when retrieving data from a SQL server running in a remote subnet. Interestingly the problem wasn’t in place when the client and SQL server were running in the same subnet J. Even though it was obvious from that testing that the WAN connectivity had something to do with the problem, our customer wanted to make sure that there were no issues at the client side and server side.
In such problem scenarios, the first thing we ask for is simultaneous network traces collected while reproducing the problem. Generally just looking at a one side network trace might yield incorrect results and may result in looking at the wrong direction. From that perspective, simultaneous network traces are a must.
Provided that simultaneous network traces are collected, how are you going to troubleshoot this problem? Well, there could be a number of approaches but I generally use the following method:
a) Find the same session (this will be most likely a TCP or UDP session) at both side traces. This may be a bit harder than expected especially if there’s NAT between the client and server side. Otherwise it should be an easy one.
b) After correctly identifying the same session at both sides, your next task will be to find the packet delays in the traces. The most appropriate way to do so is to order packets in the network trace based on time delta (time between two packets that are displayed back to back in a network trace).
c) After that, you can identify the highest time deltas within the trace (for example you can re-order packets based on time delta). If the problem is a WAN delay issue, this will generally look like that:
=> At the client side trace, most of the delays are seen in packets that are sent by the server side
=> At the server side trace, most of the delays are seen in packets that are sent by the client side.
In other words, from client perspective, it’s Server that is causing the delays and hence the performance problem. From server perspective, it’s Client that’s causing the delays and hence the performance problem. Actually both sides are WRONG! This stems from the WAN delay J. I think this may also underline the importance of collecting simultaneous network traces in such scenarios otherwise you will get the wrong results..
Well, now let’s take a look at an example to better understand this:
EXAMPLE:
Client is having performance problems in retrieving data from the SQL server and this is only visible when the client and server are running at different subnets. There’s also NAT (network address translation between the client and server). We have the IP address information and simultaneous network traces collected while reproducing the problem:
A. IP address information:
=> From client side perspective:
Client IP: 192.168.1.66
Server IP: 10.1.1.7
=> From server side perspective:
Client IP: 172.16.5.14
Server IP: 192.168.200.148
Note: IP address information is changed deliberately for data privacy
B. The session in which we see the performance problem:
Since this was a SQL server connectivity problem, it was obvious that the client should be contacting the SQL server at TCP port 1433 (if the default configuration wasn’t changed). After a few checks I found the following session in the simultaneous network traces:
Wireshark filter for Client side session:
(ip.addr eq 192.168.1.66 and ip.addr eq 10.1.1.7) and (tcp.port eq 46739 and tcp.port eq 1433)
Wireshark filter for the same session seen at Server side trace:
(ip.addr eq 172.16.5.14 and ip.addr eq 192.168.200.148) and (tcp.port eq 63584 and tcp.port eq 1433)
Now let’s look at how the communication is seen from the client perspective and server perspective:
C. From client perspective:
406 20.541677 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=11322755 TSER=0 WS=6
407 0.089037 10.1.1.7 192.168.1.66 TCP 1433 > 46739 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1380 WS=8 TSV=359429749 TSER=11322755
408 0.000036 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=11322777 TSER=359429749
409 0.000142 192.168.1.66 10.1.1.7 TDS TDS7/8 0x12 Packet
410 0.089013 10.1.1.7 192.168.1.66 TDS Response Packet
411 0.000058 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=27 Ack=27 Win=5888 Len=0 TSV=11322799 TSER=359429758
412 0.000861 192.168.1.66 10.1.1.7 TDS TDS7/8 0x12 Packet
413 0.091710 10.1.1.7 192.168.1.66 TDS TDS7/8 0x12 Packet
414 0.006916 192.168.1.66 10.1.1.7 TDS TDS7/8 0x12 Packet
415 0.093413 10.1.1.7 192.168.1.66 TDS TDS7/8 0x12 Packet
416 0.002583 192.168.1.66 10.1.1.7 TDS Unknown Packet Type: 23[Unreassembled Packet]
425 0.091253 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
426 0.000434 192.168.1.66 10.1.1.7 TDS Query Packet
427 0.089305 10.1.1.7 192.168.1.66 TDS Response Packet
428 0.000330 192.168.1.66 10.1.1.7 TDS Query Packet
429 0.088216 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
431 0.040908 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=1040 Ack=1372 Win=9280 Len=0 TSV=11322926 TSER=359429804
432 0.174717 192.168.1.66 10.1.1.7 TDS Remote Procedure Call Packet
433 0.103898 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
434 0.000056 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=1947 Ack=1875 Win=10944 Len=0 TSV=11322995 TSER=359429835
467 0.062827 192.168.1.66 10.1.1.7 TDS Remote Procedure Call Packet[Unreassembled Packet]
473 0.091735 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
474 0.000061 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=1992 Ack=1901 Win=10944 Len=0 TSV=11323034 TSER=359429852
475 0.008514 192.168.1.66 10.1.1.7 TDS Remote Procedure Call Packet
476 0.093648 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
477 0.001301 192.168.1.66 10.1.1.7 TDS Remote Procedure Call Packet[Unreassembled Packet]
478 0.087454 10.1.1.7 192.168.1.66 TDS Response Packet[Unreassembled Packet]
479 0.039780 192.168.1.66 10.1.1.7 TCP 46739 > 1433 [ACK] Seq=2944 Ack=2399 Win=12608 Len=0 TSV=11323092 TSER=359429871
From client’s perspective, responses from SQL server side come with a delay approximately 0.08 seconds or so. The client sends the next query without waiting...
D. From server perspective:
44 21.123385 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [SYN] Seq=0 Win=5840 Len=0 MSS=1380 TSV=11322755 TSER=0 WS=6
45 0.000039 192.168.200.148 172.16.5.14TCP 1433 > 63584 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 TSV=359429749 TSER=11322755
46 0.089015 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=11322777 TSER=359429749
47 0.000461 172.16.5.14 192.168.200.148 TDS TDS7/8 0x12 Packet
48 0.000117 192.168.200.148 172.16.5.14TDS Response Packet
49 0.087810 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=27 Ack=27 Win=5888 Len=0 TSV=11322799 TSER=359429758
50 0.001553 172.16.5.14 192.168.200.148 TDS TDS7/8 0x12 Packet
51 0.000427 192.168.200.148 172.16.5.14TDS TDS7/8 0x12 Packet
52 0.099177 172.16.5.14 192.168.200.148 TDS TDS7/8 0x12 Packet
53 0.004255 192.168.200.148 172.16.5.14TDS TDS7/8 0x12 Packet
54 0.091772 172.16.5.14 192.168.200.148 TDS Unknown Packet Type: 23[Unreassembled Packet]
55 0.001104 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
56 0.089833 172.16.5.14 192.168.200.148 TDS Query Packet
57 0.000183 192.168.200.148 172.16.5.14TDS Response Packet
58 0.089154 172.16.5.14 192.168.200.148 TDS Query Packet
59 0.000150 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
60 0.128460 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=1040 Ack=1372 Win=9280 Len=0 TSV=11322926 TSER=359429804
61 0.178927 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet
62 0.001507 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
63 0.098217 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=1947 Ack=1875 Win=10944 Len=0 TSV=11322995 TSER=359429835
64 0.066525 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
65 0.000168 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
66 0.089069 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=1992 Ack=1901 Win=10944 Len=0 TSV=11323034 TSER=359429852
67 0.011578 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet
68 0.001386 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
70 0.089248 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
71 0.000147 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
72 0.129948 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=2944 Ack=2399 Win=12608 Len=0 TSV=11323092 TSER=359429871
73 0.002327 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet
74 0.001011 192.168.200.148 172.16.5.14TDS Response Packet
75 0.088519 172.16.5.14 192.168.200.148 TCP 63584 > 1433 [ACK] Seq=3323 Ack=2497 Win=12608 Len=0 TSV=11323115 TSER=359429884
76 0.003928 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
77 0.000144 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
78 0.097206 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet
79 0.001297 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
80 0.089896 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
81 0.000147 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
82 0.099239 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
83 0.002980 172.16.5.14 192.168.200.148 TDS Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]
84 0.000025 192.168.200.148 172.16.5.14TCP 1433 > 63584 [ACK] Seq=2869 Ack=6121 Win=65536 Len=0 TSV=359429923 TSER=11323187
85 0.000272 192.168.200.148 172.16.5.14TDS Response Packet
86 0.091009 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
87 0.000145 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
88 0.125878 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
89 0.002955 172.16.5.14 192.168.200.148 TDS Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]
90 0.000022 192.168.200.148 172.16.5.14TCP 1433 > 63584 [ACK] Seq=3783 Ack=8147 Win=65536 Len=0 TSV=359429945 TSER=11323242
91 0.000233 192.168.200.148 172.16.5.14TDS Response Packet
92 0.095498 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
93 0.000187 192.168.200.148 172.16.5.14TDS Response Packet[Unreassembled Packet]
94 0.111501 172.16.5.14 192.168.200.148 TDS Remote Procedure Call Packet[Unreassembled Packet]
95 0.005919 172.16.5.14 192.168.200.148 TDS Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]
96 0.000019 192.168.200.148 172.16.5.14TCP 1433 > 63584 [ACK] Seq=4760 Ack=10928 Win=65536
When checking the trace from server perspective, SQL server returns TDS responses almost always immediately. But the TDS requests from the client side come to SQL server with an average delay of 0.08 seconds or so.
1) From client’s perspective, SQL server is slow in sending TDS responses
2) From SQL server’s perspective, client is slow in sending TDS requests.
3) The real problem is that TDS performance issue was stemming from WAN connectivity delay (like link delay/network device delays etc). In this instance the average WAN delay was about 80 milliseconds (0.08 seconds). Even though it looks like a small number at first look, considering that you may be exposed to that small delay let’s say 100 times while making a transaction, this could easily become an issue (like 8 seconds). So never underestimate such small values J
In this blog post, I would like to talk about a network trace analysis scenario where we were requested to find out if excessive ARP requests were legitimate or not.
Here are more details about the problem:
Problem description:
==================
Some Windows Vista clients experience network connectivity problems. After the client gets connected to the network, it is disconnected after a short while like 1 minute or so. After some initial troubleshooting, it was found out that the connectivity problem occurred because the switch port was disabled. And our customer’s network team stated that the switch port was disabled because excessive ARP requests were received from the client in a very short amount of time (like getting 10+ ARP requests within 10 milliseconds from the same source MAC etc).
How we did the troubleshooting:
=============================
We collected a network trace to better understand the root cause of the problem from one of those client machines. But the problem was that we were not able to see what the client would do after resolving those MAC addresses by ARP protocol because the switch was shutting down the port due to the mentioned security setting.
It’s also very difficult to associate an ARP packet to a process on a Windows machine by using the methods mentioned in my earlier post. So I decided to use the method 5 mentioned in the same post). I requested the customer to disable that security setting on a switch port where the problem was seen. So that we would be able to see what the client would do after resolving the MAC addresses via ARP protocol. We collected a new network trace after making such a configuration change and got to the bottom of the problem afterwards:
a) In the new network trace, it was clearly seen again that the client was sending excessive ARP requests (20 ARP requests sent for different IP addresses within the same millisecond!)
Wireshark filter:
arp.src.proto_ipv4 == 10.0.80.1 and arp.opcode==0x1
326437 30.192068 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.202? Tell 10.0.80.1
326438 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.165? Tell 10.0.80.1
326440 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.83? Tell 10.0.80.1
326441 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.59? Tell 10.0.80.1
326442 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.220? Tell 10.0.80.1
326443 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.215? Tell 10.0.80.1
326451 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.28? Tell 10.0.80.1
326453 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.60? Tell 10.0.80.1
326454 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.127? Tell 10.0.80.1
326455 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.81? Tell 10.0.80.1
326458 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.29? Tell 10.0.80.1
326462 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.98? Tell 10.0.80.1
326465 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.65? Tell 10.0.80.1
326467 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.129? Tell 10.0.80.1
326468 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.37? Tell 10.0.80.1
326470 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.239? Tell 10.0.80.1
326471 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.15? Tell 10.0.80.1
326472 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.26? Tell 10.0.80.1
326473 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.235? Tell 10.0.80.1
326475 0.000000 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.77? Tell 10.0.80.1
- There're 20 ARP requests sent in a row in the same millisecond
b) Then I decided to check a few actions that the client would take after resolving the MAC address. I started with the first one (10.0.80.202)
((arp.src.proto_ipv4==10.0.80.202 and arp.dst.proto_ipv4==10.0.80.1) or (arp.src.proto_ipv4==10.0.80.1 and arp.dst.proto_ipv4==10.0.80.202)) or (ip.addr==10.0.80.202 and ip.addr==10.0.80.1)
=> The client first resolves the MAC address of 10.0.80.202 about 2 minutes before the problem:
62994 2.518411 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.202? Tell 10.0.80.1
62996 0.000000 HewlettP_11:21:6b HewlettP_10:11:91 ARP 10.0.80.202 is at 00:23:7d:13:2f:6b
=> Then the client establishes a TCP session to TCP port 5357 and exchange some data with that system:
62997 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=2
63000 0.000000 10.0.80.202 10.0.80.1 TCP wsdapi > 59327 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8
63001 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [ACK] Seq=1 Ack=1 Win=65700 Len=0
63008 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [PSH, ACK] Seq=1 Ack=1 Win=65700 Len=223
63009 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [PSH, ACK] Seq=224 Ack=1 Win=65700 Len=658
63013 0.000000 10.0.80.202 10.0.80.1 TCP wsdapi > 59327 [ACK] Seq=1 Ack=882 Win=65536 Len=0
63133 0.046800 10.0.80.202 10.0.80.1 TCP wsdapi > 59327 [ACK] Seq=1 Ack=882 Win=65536 Len=1460
63134 0.000000 10.0.80.202 10.0.80.1 TCP wsdapi > 59327 [PSH, ACK] Seq=1461 Ack=882 Win=65536 Len=908
63136 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [ACK] Seq=882 Ack=2369 Win=65700 Len=0
=> Then the client goes into idle state in that TCP session for about 115 seconds. And then it resolves the MAC address of 10.0.80.202 once more (most likley the the ARP cache entry was deleted so source TCPIP stack needed to re-resolve the MAC address:
326437 115.073211 HewlettP_10:11:91 Broadcast ARP Who has 10.0.80.202? Tell 10.0.80.1
326483 0.000000 HewlettP_11:21:6b HewlettP_10:11:91 ARP 10.0.80.202 is at 00:23:7d:13:2f:6b
=> Right after resolving the MAC address of the target system once more, the client immediately forcefully closes the TCP session with a TCP RST segment:
326484 0.000000 10.0.80.1 10.0.80.202 TCP 59327 > wsdapi [RST, ACK] Seq=882 Ack=2369 Win=0 Len=0
c) After doing the same check for the other IP addresses that the client was trying to resolve the MAC addresses for, I saw that the behavior was the same (the client was connecting to another target at TCP port 5753 and was exchanging some data with that system and after staying idle for about 2 minutes, was re-resolving the MAC address and was forcefully closing the TCP session with a TCP RST.
SUMMARY:
=========
1) After analyzing the network trace as described above, it was more clear why the client was sending soo many ARP requests within the same milliseconds to various other targets:
To forcefully close its existing TCP sessions established to TCP port 5753 at those target machines.
2) The next question was that if that activity was legitimate or not. After doing some more research on the related protocol, I came up with the following results:
The activity on these TCP sessions (which were forcefully closed by the client) was generated by WSDAPI protocol. Also the data within the TCP segments confirmed that the activity was legitimate (seeing meaningful XML requests and responses in the traces as documented below)
http://msdn.microsoft.com/en-us/library/ee886312(VS.85).aspx Get (Metadata Exchange) HTTP Request and Message
http://msdn.microsoft.com/en-us/library/ee886311(VS.85).aspx GetResponse (Metadata Exchange) Message
This protocol seems to be used for discovering printers, Web cameras, video systems and manipulating them remotely. More details could be seen at the following articles:
http://msdn.microsoft.com/en-us/library/aa385800(VS.85).aspx About Web Services on Devices
http://blogs.msdn.com/dandris/archive/2007/11/09/wsdapi-101-part-1-web-services-from-10-000-feet.aspx
3) Even though it wasn’t very clear why the client was closing many WSDAPI sessions at the same time, it was now clear that it was a legitimate activity and our customer wanted to take it further from that point on.
In this blog post, I would like to talk about some facts about network traces. A while ago we were discussing how much we could depend on network trace data when it comes to those question: “The outgoing packet that we see in a network trace, did it really hit the wire?”, “Not seeing a packet that was supposed to be received by that system in a network trace collected on that system, does it really mean that the machine didn’t physically receive the packet?”
Before going to answers, it would be better to clarify the network trace capturing architecture so that you can give the answers for yourself:
NETWORK STACK VIEW IN GENERAL:
Upper layer stacks (Winsock etc)
==================================
NDIS Protocol drivers (Example: TCPIP)
-------------------------------------------------------
NDIS Intermediate (NDIS 5.x) or NDIS Lightweight filter drivers (NDIS 6.x)
(Examples: NLB / QoS / Teaming / 802.1q tagging / packet capturing agent / NDIS level firewalls etc)
NDIS Miniport drivers (Example: NIC drivers)
Hardware NIC
|
----------------- Wired/wireless
Note: Microsoft Network Monitor capturing agent runs as an NDIS lightweight filter driver as of Windows Vista onwards.
Note: Wireshark capturing agent (NPF/WinPcap) runs as an NDIS protocol driver. You can find more information at the below link:
http://www.winpcap.org/docs/docs_411/html/group__NPF.html
Well, after taking a look at the above architecture I think we can easily answer the questions I asked at the beginning of the post (and also some additional questions). May be we need to add one more thing to fully explain this: the order of NDIS intermediate or NDIS lightweight filter drivers. Actually this is not an easy one to answer but we can say from experience that generally NLB, teaming, NDIS level firewall drivers etc run at below packet capturing agent (for Network Monitor scenario. For Wireshark, it looks like all those run below Wireshark capturing engine winpcap since winpcap runs as an NDIS protocol driver).
QUESTIONS & ANSWERS
Q1. The outgoing packet that we see in a network trace, did it really hit the wire?
A1. No, we cannot say it for sure. After the capturing engine got a copy of the packet, the packet could have been dropped by a lower layer NDIS intermediate or lightweight filter driver like teaming, NDIS level firewall, NIC driver etc...
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Q2. Not seeing a packet in a network trace that was supposed to be received by that system, does it really mean that the machine didn’t physically receive the packet?
A2. No, we cannot say for sure. The packet might have been dropped by a lower level NDIS miniport driver or NDIS intermediate/lightweight filter driver before it reaches capturing engine on the same box
(Examples: You cannot see packets on an NLB node if that node isn’t supposed to handle that packet. You may not see the packet because a lower level NDIS driver)
Q3. Seeing a packet received in a network trace, does it really mean that the upper layers (like TCPIP/Winsock/Application/etc) on this machine has *seen* that packet?
A3. No, we cannot say for sure. The packet might have been dropped by an upper layer NDIS intermediate/lightweight filter driver or an upper layer firewall or similar security softwares (intrusion detection/URL filtering/etc)
(Examples: For example you see incoming TCP SYNs in a network trace but you don’t see the TCP SYN ACKs being sent in response. This generally stems from a firewall/similar security driver running somewhere at the top of capturing engine. Since the packet is dropped before TCPIP driver had a chance to process it, no response is sent back to the connecting party. Another example could be a higher level filter again that prevent the request from being seen by the application (like winsock level filters)
Q4. I see that there’re TCP segments or IP datagrams with incorrect checksums, does that really mean the packet is problematic and causing the performance/connectivity issue that I see?
A4. The answer is it depends. If you see those checksum errors for packets sent by A to B and the trace is collected on A and TCP/IP checksum offloading is enabled (by default for many NICs) then it doesn’t indicate a problem. May we should also explain why we see it that way:
TCP/IP header checksum offloading is a performance feature supported by many NICs. With this feature, TCPIP driver pass the packet down to NIC driver without calculating the checksum for IP or TCP headers. Since the capturing engine *sees* the packet before the NIC driver, it concludes that the checksum related fields are incorrect (which is normal because those checksums will be calculated by NIC driver before the packet hits the wire). For other situations, this would indicate a problem.
Q5. I sometimes see packets whose payloads are greater than standard Ethernet MTU in a network trace. How could that be possible?
A5. Some NIC drivers support LSO - Large Send offloading feature which is another optimization. Large send offloading allows TCPIP driver to offload TCP data segmentation (fragmentation) to NIC driver to accelarate the process. Also please note that in some instances, generally due to NIC driver issues, LSO might cause performance problems. You can find more information about this feature at the following link:
http://www.microsoft.com/whdc/device/network/taskoffload.mspx Windows Network Task Offload
Hope this helps.
One of the most common reasons for network connectivity or performance problems is packet drop. In this blog post, I’ll be talking about analyzing a packet drop issue, please read on.
One of customers was complaining about remote SCCM agent policy updates and it was suspected a network packet drop issue. Then we were involved in to analyze the problem from networking perspective. Generally such problems might stem from the following points:
a) A problem on the source client (generally at NDIS layer or below). For additional information please see a previous blog post.
b) A problem stemming from network itself (links, firewalls, routers, proxy devices, encryption devices, switches etc). This is the most common problem point in such cases.
c) A problem on the target server (generally at NDIS layer or below). For additional information please see a previous blog post.
In packet drop issues, the most important logs are simultaneous network traces collected on the source and target systems.
You’ll find below more details about how we got to the bottom of the problem:
NETWORK TRACE ANALYSIS:
=======================
In order for a succesful network trace analysis, you need to be familiar with the technology that you’re troubleshooting. At least you should have some prior knowledge on the network activity that the related action (like how SCCM agent retrieves policies from SCCM server in general for this exampl) would generate. In this instance, after a short discussion with an SCCM colleague of mine, I realized that SCCM agent sends an HTTP POST request to the SCCM server to retrieve the policies. I analyzed the network traces in the light of this fact:
How we see the problematic session at SCCM agent side trace (10.1.1.1):
1917 104.968750 10.1.1.1 172.16.1.1 TCP rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1918 105.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460
1919 105.000000 10.1.1.1 172.16.1.1 TCP rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
1920 105.000000 10.1.1.1 172.16.1.1 HTTP CCM_POST /ccm_system/request HTTP/1.1
1921 105.000000 10.1.1.1 172.16.1.1 HTTP Continuation or non-HTTP traffic
1922 105.000000 10.1.1.1 172.16.1.1 HTTP Continuation or non-HTTP traffic
1925 105.140625 172.16.1.1 10.1.1.1 TCP http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 Len=0 SLE=1742 SRE=2328
1975 107.750000 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic
2071 112.890625 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic
2264 123.281250 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic
2651 144.171875 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic
3475 185.843750 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic
4392 234.937500 172.16.1.1 10.1.1.1 TCP http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0
How we see the problematic session at SCCM Server side trace (172.16.1.1):
13587 100.765625 10.1.1.1 172.16.1.1 TCP rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460
13588 0.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460
13591 0.031250 10.1.1.1 172.16.1.1 TCP rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 Len=0
13598 0.031250 10.1.1.1 172.16.1.1 HTTP CCM_POST /ccm_system/request HTTP/1.1
13611 0.078125 10.1.1.1 172.16.1.1 HTTP [TCP Previous segment lost] Continuation or non-HTTP traffic
13612 0.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 [TCP CHECKSUM INCORRECT] Len=0 SLE=1742 SRE=2328
30509 129.812500 172.16.1.1 10.1.1.1 TCP http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0
Explanation on color coding:
a) GREEN packets are the ones that we both see at client and server side traces. In other words, those are packets that were either sent by the client and successfully received by the server or sent by the server and successfully received by the client.
b) RED packets are the ones that were sent by the client but not received by the server. To further provide details on this part:
- The frame #1921 that we see at the client side trace (which is a continuation packet for the HTTP POST request) is not visible at the server side trace
- The frames #1975, 2071, 2264, 2651 and 3475 are retransmissions of frame #1921. We cannot even see those 5 retransmissions of the same original TCP segment at the server side trace which means most likely all those retransmission segments were lost on the way to the server even though there’s still a minor chance that they might have been physically received by the server but dropped by an NDIS level driver (see this post for more details)
The server side forcefully closes the session approximately after 2 minutes. (Client side frame #4392 and server side frame #30509)
1) The simultaneous network trace analysis better indicated that this was a packet drop issue.
2) The problem most likely stems from one of these points:
a) An NDIS layer or below problem at the SCCM agent side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)
b) A network related problem (link or network device problem like router/switch/firewall/proxy issue)
c) An NDIS layer or below problem at the SCCM server side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)
3) In such scenarios we may consider doing the following at the client and server side:
a) Updating NIC/teaming drivers
b) Updating 3rd party filter drivers (AV/Firewall/Host IDS software/other kind of filtering devices) or temporarily uninstalling them for the duration of troubleshooting
4) Most of the time such problems stem from network itself. And you may consider taking the following actions on that:
a) Checking cabling/switch port
b) Checking port speed/duplex settings (you may try manually setting to 100 MB/ FD for example)
c) Enabling logging on Router/Switch/firewall/proxy or similar devices and check interface statistics or check problematic sessions to see if there are any packet drops
d) Alternatively you may consider collecting network traces at more than 2 points (sending & receiving ends). For example, 4 network traces could be collected: at the source, at the target and at two intermediate points. So that you can follow the session and may isolate the problem to a certain part of your network.
For your information, the specific problem I mentioned in this blog post was stemming from a router running in between the SCCM agent and server.
In today’s blog post I’m going to talk about a network trace analysis scenario where I was requested to analyze a few network traces to understand why a server was trying to contact external web servers when sqllogship.exe was run on it.
Our customer’s security team noticed that there were http connection attempts coming from internal SQL servers where those servers wouldn’t supposed to be connecting any external servers. The only thing they were running was something like that:
"C:\Program Files\Microsoft SQL Server\90\Tools\Binn\sqllogship.exe" -Backup 1B55E77D-A000-1EE8-9780-441096E2151 -server PRODDB
And in every attempt there were blocked http connections seen on the firewall. Since we didn’t know what the server would do after establishing such an HTTP connection to an external network we weren’t able to make much comment on this. I requested our customer to let the firewall allow such an http connection so that we would be able to get more information after the connection was established, this is a method (method 5) I mentioned in one of my earlier posts
After our customer made such a change and re-collected a network trace on the SQL server, it was now more clear why the SQL server was attempting to connect to a remote web server: To verify if the certificate was revoked or not by downloading the CRL (certificate revocation list):
=> SQL server first resolves the IP address for the name: crl.microsoft.com
23519 2010-06-26 09:23:14.560786 10.11.1.11 10.1.1.1 DNS Standard query A crl.microsoft.com
23520 2010-06-26 09:23:14.561000 10.1.1.1 10.11.1.11 DNS Standard query response CNAME crl.www.ms.akadns.net
|-> crl.microsoft.com: type CNAME, class IN, cname crl.www.ms.akadns.net
|-> crl.www.ms.akadns.net: type CNAME, class IN, cname a1363.g.akamai.net
|-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.18
|-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.50
=> SQL server establishes a TCP session to port 80 at the remote web server running on 193.45.15.50:
69679 2010-06-26 09:24:37.466403 10.11.1.11 193.45.15.50 TCP 2316 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
69697 2010-06-26 09:24:37.554390 193.45.15.50 10.11.1.11 TCP 80 > 2316 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460
69698 2010-06-26 09:24:37.554407 10.11.1.11 193.45.15.50 TCP 2316 > 80 [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
=> After the TCP 3-way handshake, the SQL server sends an HTTP GET request to the web server to retrieve the CSPCA.crl file
69699 2010-06-26 09:24:37.554603 10.11.1.11 193.45.15.50 HTTP GET /pki/crl/products/CSPCA.crl HTTP/1.1
|-> GET /pki/crl/products/CSPCA.crl HTTP/1.1\r\n
|-> User-Agent: Microsoft-CryptoAPI/5.131.3790.3959\r\n
|-> Host: crl.microsoft.com\r\n
69729 2010-06-26 09:24:37.642219 193.45.15.50 10.11.1.11 TCP 80 > 2316 [ACK] Seq=1 Ack=199 Win=6432 Len=0
69731 2010-06-26 09:24:37.645483 193.45.15.50 10.11.1.11 PKIX-CRL Certificate Revocation List
|-> HTTP/1.1 200 OK\r\n
|-> Certificate Revocation List
|-> signedCertificateList
|-> algorithmIdentifier (shaWithRSAEncryption)
Note: It looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/944752)
“When the Microsoft .NET Framework 2.0 loads a managed assembly, the managed assembly calls the CryptoAPI function to verify the Authenticode signature on the assembly files to generate publisher evidence for the managed assembly.”
=> Similarly the server sends another HTTP GET request to retrieve CodeSignPCA.crl:
77631 2010-06-26 09:24:52.642968 10.11.1.11 193.45.15.50 HTTP GET /pki/crl/products/CodeSignPCA.crl HTTP/1.1
77747 2010-06-26 09:24:52.733106 193.45.15.50 10.11.1.11 PKIX-CRL Certificate Revocation List
78168 2010-06-26 09:24:53.011176 10.11.1.11 193.45.15.50 TCP 2316 > 80 [ACK] Seq=403 Ack=1961 Win=65535 [TCP CHECKSUM INCORRECT] Len=0
Note: Again it looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/947988 You cannot install SQL Server 2005 Service Pack 1 on a SQL Server 2005 failover cluster if the failover cluster is behind a firewall)
“When the Microsoft .NET Framework starts SSIS, the .NET Framework calls the CryptoAPI function. This function determines whether the certificates that are signed to the SQL Server assembly files are revoked. The CryptoAPI function requires an Internet connection to check the following CRLs for these certificates:
http://crl.microsoft.com/pki/crl/products/CodeSignPCA.crl
http://crl.microsoft.com/pki/crl/products/CodeSignPCA2.crl”
It looks like there’re a number of solutions to prevent such CRL checks like changing “generatePublisherEvidence” or “Check for publisher’s certificate revocation” as explained in KB944752 or KB947988.
In today’s blog post, I’m going to talk about some registry keys that were removed as of Windows 2008. Recently a colleague raised a customer question about configuring the following AFD related registry keys on Windows 2008:
DynamicBacklogGrowthDelta EnableDynamicBacklog MaximumDynamicBacklog MinimumDynamicBacklog
Actually our customer was trying to implement the settings mentioned in How To: Harden the TCP/IP Stack. But none of our documentation on Vista/2008/Windows 7 and Windows 2008 R2 referred to such AFD related registry keys.
A quick source code review revealed that those registry keys weren’t valid as of Windows 2008. Those registry keys were mainly used to encounter TCP SYN attacks at Winsock layer on Windows 2003. Since SYN attack protection was built-in on Windows Vista, 2008, 2008 R2 or Windows 7 (and even couldn’t be disabled - please see this blog post for more information on TCP SYN attack protection on Windows Vista/2008/2008 R2/7), it wasn’t required to deal with SYN attacks at Winsock layer and as a result of that, the logic and the registry keys were removed from AFD driver.
As an additional note here, I also wouldn’t recommend implementing EnablePMTUDiscovery registry key which is also mentioned in the document How To: Harden the TCP/IP Stack because of reasons mentioned in a previous blog post. Also SYN attack protection related registry keys mentioned in the same article don’t apply to Window Vista onwards.
In today’s blog post, I’m going to talk about a local resource access performance issue.
One of our customers reported that SQL server instances running on a Windows 2003 server were failing over to another node especially when CPU load on the server was high for a short period (like 5 seconds or so). After some research by our SQL expert, it was determined that the fail over occurred because the IsAlive() function implemented by the SQL server does a “SELECT @@SERVERNAME” T-SQL query to the local SQL server instance by using the local IP address and that query doesn’t return in a timely manner.
When loopback interface was monitored during the problem time, it was also seen that “Output Queue Length” for the MS TCP loopback interface was dramatically increasing and then dropping down to 0 and then dramatically increase again like below:
Where access to local SQL instance was too slow as such, the same SQL instance could be accessed without issues from remote clients. That behavior suggested that there was a problem with loopback access.
RESOLUTION:
===========
One of our senior escalation engineers (Thanks to PhilJ) mentioned that loopback access (accessing to 127.0.0.1 or any other local IP addresses) simply queues up Work items to be later processed by a function in AFD driver. The work items queued that way are later processed by a worker thread in kernel which runs a function provided by AFD driver. Normally system can create up to 16 of such dynamic worker threads and those worker threads run at a priority level of 12. If the system had a higher priority work and if it wasn’t possible to create new worker threads, then the problem might be visible as explained above.
There’s a way to make more delayed worker threads available initially which could be configured as follows:
HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Executive
AdditionalDelayedWorkerThreads (REG_DWORD)
You can find more information at the following link:
http://msdn.microsoft.com/en-us/library/ee377058(BTS.10).aspx Registry Settings that can be Modified to Improve Operating System Performance
Even though the AdditionalDelayedWorkerThreads registry key was set to a higher value than default, the problem was still in place. Finally it was decided to get a kernel dump to better understand what was going wrong with those Delayed worker threads so that local access was too slow. Before going to the conclusion part, it would be good to state that this local resource access problem was not specific to SQL and could be experienced with any application locally accessing a resource (there were similar issues reported by other customers like “slow performance seen when local applications accessing local web services, databases” etc)
Finally a kernel dump has revealed the real problem why the delayed worker threads couldn’t catch up with the work items being posted:
A 3rd party Firewall driver was posting many delayed work items and delayed worker threads processing these work items were all trying to synchronize on the same executive resource (which was apparently a contention point) and hence the system was hitting the ceiling for the number of worker threads quickly so that new items couldn’t processed in a timely manner and local resource access was being too slow. I just wanted to give a sample call stack for one of those delayed worker threads for your reference:
THREAD fffffaee4460aaf0 Cid 0004.0044 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (WrResource) KernelMode Non-Alertable
fffffadc3e9dc0a0 SynchronizationEvent
fffffadc44605ca8 NotificationTimer
Not impersonating
DeviceMap fffffa80000036d0
Owning Process fffffadc44622040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 294068 Ticks: 6 (0:00:00:00.093)
Context Switch Count 200681
UserTime 00:00:00.000
KernelTime 00:00:03.500
Start Address nt!ExpWorkerThread (0xfffff800010039f0)
Stack Init fffffadc1f1c1e00 Current fffffadc1f1c1950
Base fffffadc1f1c2000 Limit fffffadc1f1bc000 Call 0
Priority 13 BasePriority 12 PriorityDecrement 1
Child-SP RetAddr : Args to Child : Call Site
fffffadc`1f1c1990 fffff800`01027682 : fffffadc`1edcb910 fffffadc`1edeb180 00000000`0000000b fffffadc`1ed2b180 : nt!KiSwapContext+0x85
fffffadc`1f1c1b10 fffff800`0102828e : 0000000a`b306fa71 fffff800`011b4dc0 fffffadc`44605c88 fffffadc`44605bf0 : nt!KiSwapThread+0x3c9
fffffadc`1f1c1b70 fffff800`01047688 : 00000000`000000d4 fffffadc`0000001b fffffadc`1edeb100 fffffadc`1edeb100 : nt!KeWaitForSingleObject+0x5a6
fffffadc`1f1c1bf0 fffff800`01047709 : 00000000`00000000 fffffadc`167a6c70 fffffadf`fbc44a00 fffff800`01024d4a : nt!ExpWaitForResource+0x48
fffffadc`1f1c1c60 fffffadc`167a720b : fffffadc`3d41fc20 fffffadc`167a6c70 fffffadc`44605bf0 fffffadc`3d3deef8 : nt!ExAcquireResourceExclusiveLite+0x1ab
fffffadc`1f1c1c90 fffffadc`167a6c87 : 00000000`00000001 fffffadf`fbc44a00 fffff800`011cda18 fffffadc`44605bf0 : XYZ+0x1220b
fffffadc`1f1c1cd0 fffff800`010375ca : 00000000`00000000 fffffadc`1d6f7001 00000000`00000000 fffffadf`fbc44a00 : XYZ+0x11c87
fffffadc`1f1c1d00 fffff800`0124a972 : fffffadc`44605bf0 00000000`00000080 fffffadc`44605bf0 fffffadc`1edf3680 : nt!ExpWorkerThread+0x13b
fffffadc`1f1c1d70 fffff800`01020226 : fffffadc`1edeb180 fffffadc`44605bf0 fffffadc`1edf3680 00000000`00000000 : nt!PspSystemThreadStartup+0x3e
fffffadc`1f1c1dd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16
Note: The driver name was removed on purpose.
After the problem with the 3rd party Firewall driver was addressed, the issue was resolved. Also thanks to Ignacio J who driven this case from a technical perspective and shared the resolution details with us.
Hope this helps in your similar problems.
We are sometimes requested to find out which process sends a certain packet. This generally stems from security concerns (like seeing some TCP session attempts on firewall/IDS device logs etc). Even though it may not be exactly possible to tell which process sends a certain packet always, most of the time we can find the process name without any live debugging etc)
So here are some methods to identify the process sending a certain packet out to network:
1) Network Monitor (For any OS version)
Network Monitor 3.x can show you the process name most of the time correctly (even though there could be some situations when you cannot identify the process)
Note: You can download the latest Network Monitor version at the following link:http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=983b941d-06cb-4658-b7f6-3088333d062f
2) Process Monitor (For any OS version)
Process Monitor can also display process name for TCP/UDP activity
Note: You can download Process Monitor tool at the following link:http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
3) For Windows 7 / Windows 2008 R2
You can use the new built-in ETL tracing available at NDIS layer. All you need to do is to start a new ETL packet capturing session. This method doesn't even require you to install any sniffing software (Network Monitor/Wireshark etc). You can use this option for general packet capturing on Windows 7/Windows 2008 R2 as well:
netsh trace start capture=yes tracefile=c:\temp\capture1.etl
<<Repro the behavior>>
netsh trace stop
Note: You'll need to open the created ETL file with Network Monitor 3.3/3.4
4) For Windows XP/2003
=> You can use the following method to find out the process if it's a TCP or UDP packet
A. To enable the kernel logger and have it generate a log file of TCP/IP activity, follow these steps:
1. Run the Performance Tool, and select the Performance Logs And Alerts node.2. Select Trace Logs, and then select New Log Settings from the Action menu.3. When prompted, enter a name for the settings (for example, Microsoft Data).4. On the dialog box that opens, select the Events Logged By System Provider option and then deselect everything except the Network TCP/IP option.5. In the Run As edit box, enter the Administrator account name and set the password to match it.6. Dismiss the dialog box, and wait until the event is logged. 7. Once the event is logged, select Stop from the Action menu in Perfmon.8. Open a command prompt, and change to the C:\Perflogs directory (or the directory into which you specified that the trace log file be stored) and get the
Perfmon trace file from there.
B. Analyze the collected trace file:
1. If you are running Windows XP or Windows Server 20003, run Tracerpt (located in the \Windows\System32 directory) and pass it the name of the trace log file. If you are running Windows 2000, download and run Tracedmp from the Windows 2000 Resource Kit. Both tools generate two files: dumpfile.csv and summary.txt.2. Rename dumpfile.csv as dumpfile.txt, start Microsoft Excel, click on Menu and Open. In the open dialog box locate the folder where dumpfile.txt is and change the file type box to text file. dumpfile.txt is visible now, click on it twice. Text Import Wizard pops up. Choose delimited, next, on the next page check only comma at delimiters, next, select all the columns in the Data preview box and change the Data type to text, finish.3. Rows starting with Event Name, EventTrace, SystemConfig contain information about the trace, hardware, running services... etc.4. Rows starting with UdpIp or TcpIp ...
UdpIp or TcpIP, Send or Recv, ?, ?, ?, ?, PID, Payload of the transport (byte), Destination IP, Source IP, Destination port, Source port, ?, ?
An example output:
... UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.016.001.001, 172.016.057.015, 2762, 36003, 0, 0 UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.027.045.020, 172.016.057.015, 6004, 36009, 0, 0 UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.160.076.112, 172.016.057.015, 2344, 36016, 0, 0 UdpIp, Send, 0xFFFFFFFF, 129079416141424158, 0, 0, 2136, 8, 172.048.102.066, 172.016.057.015, 1116, 36022, 0, 0...
- For example, in the first line, 172.016.057.015 host is sending a UDP packet to 172.016.001.001. Packet lenght is 8 bytes and source UDP port is 36003 and
destination UDP port is 2762. And process ID that is sending the UDP packet is 2136. And the tasklist output collected on the same machine was as follows:
...cqmghost.exe 4972 Console 0 4,324 Ksvchost.exe 5024 Console 0 1,524 Kwmiprvse.exe 5316 Console 0 2,192 Kwmiprvse.exe 5468 Console 0 976 Klogon.scr 6080 Console 0 200 Kmssearch.exe 6028 Console 0 5,828 Kwmiprvse.exe 5176 Console 0 2,168 Kmad.exe 7176 Console 0 45,792 KAntigenStore.exe 10092 Console 0 200 Kstore.exe 2136 Console 0 1,040,592 Kemsmta.exe 12020 Console 0 29,092 KAntigenIMC.exe 1308 Console 0 200 Ksvchost.exe 6516 Console 0 1,992 K...
So the process that was sending the UDP packet was store.exe (process Id:2136)
5) Special scenarios
In some special scenarios, you may not be able to find the process by using the above methods. For example, in a couple of cases I was requested to find out why a certain client was sending so many ARP requests back to back (as it was causing some alarms to be generated by IDS devices and the LAN switch was disabling the switch port after a short while) and none of the above methods helped. Then I decided to do it that way: let us allow the client to do whatever it likes by loosening the existing security settings (like changing switch configuration so that it doesn't disable the port). That way we had a chance to see what the client was doing after those ARP requests. Another similar problem was reported by another customer. They realized that the DCs were trying to get in touch with the clients by using different ports/protocols like TCP 445/139, ICMP, Kerberos etc. Again we led the DC do whatever it tried to do so that we were able to see the activity and hence the process/reason behind that. I'm going to talk about this in another blog post.
Thanks,Murat