In this blog post, I’ll talk about another network trace analysis scenario.
The problem was that some Windows XP clients were copying files from a NAS device very slowly compared to others. As one of the most useful logs to troubleshoot such problems, I requested a network trace to be collected on a problem Windows XP client. Normally it’s best to collect simultaneous network traces but it was a bit diffcult to collect a trace at the NAS device side so we were limited to a client side trace.
Before I start explaining how I got to the bottom of the issue, I would like to provide you with some background on how files are read by Windows via SMB protocol so that you’ll better understand the resolution part:
Windows XP and Windows 2003 use SMB v1 protocol for remote file system access (like creating/reading/writing/deleting/locking files over a network connection). Since it was a file read from the remote server in this scenario, the following SMB activity would be seen between the client and server:
The client will open the file at the server first:
SMB Create AndX request ---->
<---- SMB Create AndX response
Then the client will send SMB Read AndX requests to retrieve blocks of the file:
SMB Read AndX request ----> (61440 bytes)
<---- SMB Read AndX response
<---- SMB Read AndX response
Note: SMBv1 protocol could request 61 KB of data at most in one SMB Read AndX request.
After this short overview, let’s get back to the original problem and analyze the packets taken from the real network trace:
Frame# Time delta Source IP Destination IP Protocol Information
===== ======== ========= ========== ====== ========
59269 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263823360
59270 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes
59271 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65993793
59272 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65995249
59273 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65996705
59320 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66049121
59321 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66050577
59322 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66052033
59323 0.000000 10.1.1.1 10.1.1.2 TCP foliocorp > microsoft-ds [ACK] Seq=67600 Ack=66053489 Win=65535
59325 0.406250 10.1.1.2 10.1.1.1 TCP [Continuation to #59270] microsoft-ds > folioc [PSH, ACK]Seq=66053489
59326 0.000000 10.1.1.1 10.1.1.2 SMB Read AndX Request, FID: 0x0494, 61440 bytes at offset 263884800
59327 0.000000 10.1.1.2 10.1.1.1 SMB Read AndX Response, 61440 bytes
59328 0.000000 10.1.1.2 10.1.1.1 TCP [Continuation to #59327] microsoft-ds > foliocorp [ACK] Seq=66055297
Now let’s take a closer look at some related frames:
Frame# 59269 => Client requests the next 61 KB of data at offset 263823360 from the file represented with FID 0x0494 (this FID is assigned by server side when the file is first opened/created)
Frame# 59270 => Server starts sending 61440 bytes of data back to the client in SMB Read AndX response.
Frame# 59271 => The remaining parts are sent in 1460 bytes chunks because of TCP MSS negotiated, it’s generally 1460 bytes. (like frame# 59272, frame# 59273 etc)
The most noticable thing in the network trace was to see many such 0.4 seconds delays (like the one that we see at frame #59325). Those 0.4 seconds delays were always present at the last fragment of 61 KB of data returned by the server.
Normally 0.4 seconds could be seen as a very low delay but considering that the client will send n x SMB Read Andx request to the server to read the file it will quickly be clear that 0.4 seconds of delay is huge (for example, the client needs to send 1000 SMB Read AndX requests to read a 64 MB file)
Generally we’re used to see some delays in network traces due to packet retransmissions (due to packet loss) or link transfer delayes etc. But seeing a constant delay of 0.4 seconds in every last fragment of a 61 KB block made me suspect that a QoS implementation was in place somewhere between the client and server. By delaying every read request about 0.4 seconds, actually file copy is being slowed down on purpose: traffic shaping/limiting.
Since we didn’t have a network trace collected at the NAS device side, we couldn’t check if the QoS policy was in effect at the NAS device side or on a network device running in between the two. (we checked the client side and there was no QoS configuration in place). After further checking the network devices, it turned out that there was an incorrectly configured QoS policy on one of them. After making the required changes, the problem was resolved...
Hope this helps
I am seeing exactly the same behiavor in a similar environment; could you provide details on the specific incorrectly configured QoS policy on the affected network device?
Yes, it would be very helpful to know the details of the correction made to QoS.