August, 2010

  • Effects of incorrect QoS policies: A story behind a slow file copy...

    Hi there,

     

    In this blog post, I’ll talk about another network trace analysis scenario.

     

    The problem was that some Windows XP clients were copying files from a NAS device very slowly compared to others. As one of the most useful logs to troubleshoot such problems, I requested a network trace to be collected on a problem Windows XP client. Normally it’s best to collect simultaneous network traces but it was a bit diffcult to collect a trace at the NAS device side so we were limited to a client side trace.

     

    Before I start explaining how I got to the bottom of the issue, I would like to provide you with some background on how files are read by Windows via SMB protocol so that you’ll better understand the resolution part:

     

    Windows XP and Windows 2003 use SMB v1 protocol for remote file system access (like creating/reading/writing/deleting/locking files over a network connection). Since it was a file read from the remote server in this scenario, the following SMB activity would be seen between the client and server:

     

    Client                                      Server

    =====                                     ======

    The client will open the file at the server first:

     

    SMB Create AndX request ---->

                                              <---- SMB Create AndX response

     

    Then the client will send SMB Read AndX requests to retrieve blocks of the file:

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

    ...

     

    Note: SMBv1 protocol could request 61 KB of data at most in one SMB Read AndX request.

     

     

    After this short overview, let’s get back to the original problem and analyze the packets taken from the real network trace:

     

    Frame#  Time delta    Source IP             Destination IP     Protocol        Information

    =====     ========     =========           ==========         ======            ========

    ....

    59269    0.000000              10.1.1.1                10.1.1.2                SMB       Read AndX Request, FID: 0x0494, 61440 bytes at offset 263823360

    59270    0.000000              10.1.1.2                10.1.1.1                SMB       Read AndX Response, 61440 bytes

    59271    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65993793

    59272    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65995249

    59273    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65996705

    ...

    59320    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66049121

    59321    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66050577

    59322    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66052033

    59323    0.000000              10.1.1.1                10.1.1.2                TCP        foliocorp > microsoft-ds [ACK] Seq=67600 Ack=66053489 Win=65535

    59325    0.406250              10.1.1.2                10.1.1.1                TCP       [Continuation to #59270] microsoft-ds > folioc [PSH, ACK]Seq=66053489

     

    59326    0.000000              10.1.1.1                10.1.1.2                SMB       Read AndX Request, FID: 0x0494, 61440 bytes at offset 263884800

    59327    0.000000              10.1.1.2                10.1.1.1                SMB       Read AndX Response, 61440 bytes

    59328    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59327] microsoft-ds > foliocorp [ACK] Seq=66055297

    ...

     

    Now let’s take a closer look at some related frames:

     

    Frame# 59269 => Client requests the next 61 KB of data at offset 263823360 from the file represented with FID 0x0494 (this FID is assigned by server side when the file is first opened/created)

     

    Frame# 59270 => Server starts sending 61440 bytes of data back to the client in SMB Read AndX response.

     

    Frame# 59271 => The remaining parts are sent in 1460 bytes chunks because of TCP MSS negotiated, it’s generally 1460 bytes. (like frame# 59272, frame# 59273 etc)

     

    The most noticable thing in the network trace was to see many such 0.4 seconds delays (like the one that we see at frame #59325). Those 0.4 seconds delays were always present at the last fragment of 61 KB of data returned by the server.

     

    Normally 0.4 seconds could be seen as a very low delay but considering that the client will send n x SMB Read Andx request to the server to read the file it will quickly be clear that 0.4 seconds of delay is huge (for example, the client needs to send 1000 SMB Read AndX requests to read a 64 MB file)

     

    Generally we’re used to see some delays in network traces due to packet retransmissions (due to packet loss) or link transfer delayes etc. But seeing a constant delay of 0.4 seconds in every last fragment of a 61 KB block made me suspect that a QoS implementation was in place somewhere between the client and server. By delaying every read request about 0.4 seconds, actually file copy is being slowed down on purpose: traffic shaping/limiting.

     

    Since we didn’t have a network trace collected at the NAS device side, we couldn’t check if the QoS policy was in effect at the NAS device side or on a network device running in between the two. (we checked the client side and there was no QoS configuration in place). After further checking the network devices, it turned out that there was an incorrectly configured QoS policy on one of them. After making the required changes, the problem was resolved...

     

    Hope this helps

     

    Thanks,

    Murat

  • Why should a DC contact clients in the domain?

    Hi there,

     

    In today’s blog post, I’m going to show you how I found out why a Domain controller was contacting random clients in the domain. This issue was reported by the customer due to security concerns. They suspected that a suspicious process might be running on the DC and the case was raised as a result of security concerns. In general we don’t expect Domain controllers to contact the clients running in the domain so our customer wanted to understand the reason behind that.

     

    We first verified that the DC was really contacting some clients by collecting a network trace on the DC. You can see one of those clients (client1) contacted by the DC (DC1):

    Note: DC and client IP addresses are replaced for data privacy.

     

    ...

    11415      14:21:12 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=3912, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=70128947, Ack=0, Win=65535 (  ) = 65535     {TCP:515, IPv4:46}

    11443      14:21:12 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=3913, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3133793441, Ack=0, Win=65535 (  ) = 65535 {TCP:518, IPv4:46}

    30922      14:33:17 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4118, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2414564040, Ack=0, Win=65535 (  ) = 65535 {TCP:1270, IPv4:46}

    30950      14:33:17 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4120, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1797119693, Ack=0, Win=65535 (  ) = 65535 {TCP:1273, IPv4:46}

    51472      14:45:22 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4314, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1834145861, Ack=0, Win=65535 (  ) = 65535 {TCP:1403, IPv4:46}

    51500      14:45:22 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4315, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=4278939251, Ack=0, Win=65535 (  ) = 65535 {TCP:1406, IPv4:46}

    67096      14:57:26 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4514, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1707963693, Ack=0, Win=65535 (  ) = 65535 {TCP:1945, IPv4:46}

    67126      14:57:26 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4515, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3807245641, Ack=0, Win=65535 (  ) = 65535 {TCP:1948, IPv4:46}

    74691      15:09:30 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4740, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=1036190517, Ack=0, Win=65535 (  ) = 65535 {TCP:1983, IPv4:46}

    74721      15:09:31 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4741, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2281072822, Ack=0, Win=65535 (  ) = 65535 {TCP:1986, IPv4:46}

    84937      15:21:35 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4930, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=3190224054, Ack=0, Win=65535 (  ) = 65535 {TCP:2104, IPv4:46}

    84965      15:21:35 05.07.2010              DC1  CLIENT1         TCP          TCP:Flags=......S., SrcPort=4931, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2774224583, Ack=0, Win=65535 (  ) = 65535 {TCP:2107, IPv4:46}

    ...

     

    At first look, it drew my attention that the connection attempt was repeated every 12 minutes or so. Then this should have been something running periodically on the DC. Normally Network Monitor should show you the process that is initiating those TCP sessions but under heavy load Network monitor stops to do so in favor of performance as it’s a costly operation. There’re some other methods to find out a process sending a certain packet but I decided to let the DC do whatever it would do against the client to see the whole activity.

     

    So the customer removed firewall filters and allowed the DC to connect to Client1. After doing so we collected a new network trace to see the latest situation. We got the expected results by examining the new network trace:

     

    a) The first interesting finding was that the client was sending a “Master Browser” announcement to the DC (DC1) shortly before one of these connection attempts from the DC side:

     

    47140      07:30:31 08.07.2010              CLIENT1             DC1 BROWSER              BROWSER:Master Announcement       {SMB:351, UDP:350, IPv4:3}

     

    b) After that browser announcement, the DC contacted the client at TCP port 139 to establish an SMB session:

     

    47595      07:30:33 08.07.2010              DC1 CLIENT1             TCP          TCP:Flags=......S., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594372577, Ack=0, Win=65535 (  ) = 65535              {TCP:373, IPv4:3}

    47596      07:30:33 08.07.2010              CLIENT1  DC1            TCP          TCP:Flags=...A..S., SrcPort=NETBIOS Session Service(139), DstPort=3787, PayloadLen=0, Seq=2981880191, Ack=2594372578, Win=8192 ( Scale factor not supported ) = 8192   {TCP:373, IPv4:3}

    47597      07:30:33 08.07.2010              DC1 CLIENT1             TCP          TCP:Flags=...A...., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594372578, Ack=2981880192, Win=65535 (scale factor 0x0) = 65535    {TCP:373, IPv4:3}

     

    c) Then it initiated a NetBT session to the client:

     

    47598      07:30:33 08.07.2010              DC1 CLIENT1             NbtSS      NbtSS:SESSION REQUEST, Length =68 {NbtSS:374, TCP:373, IPv4:3}

    47599      07:30:33 08.07.2010              CLIENT1             DC1 NbtSS      NbtSS:POSITIVE SESSION RESPONSE, Length =0                {NbtSS:374, TCP:373, IPv4:3}

     

    d) Then it established an SMB connection:

     

    47600      07:30:33 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12       {NbtSS:374, TCP:373, IPv4:3}

    47602      07:30:33 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Negotiate, Dialect is NT LM 0.12 (#5), SpnegoToken (1.3.6.1.5.5.2)         {NbtSS:374, TCP:373, IPv4:3}

    47614      07:30:34 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE                {NbtSS:374, TCP:373, IPv4:3}

    47615      07:30:34 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED                {NbtSS:374, TCP:373, IPv4:3}

    47616      07:30:34 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Session Setup Andx, NTLM AUTHENTICATE MESSAGE, Workstation: DC1           {NbtSS:374, TCP:373, IPv4:3}

    47621      07:30:34 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Session Setup Andx  {NbtSS:374, TCP:373, IPv4:3}

     

    e) Then it connected to the interprocess communication share (IPC$):

     

    47625      07:30:34 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Tree Connect Andx, Path = \\CLIENT1\IPC$, Service = ?????      {NbtSS:374, TCP:373, IPv4:3}

    47626      07:30:34 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Tree Connect Andx, Service = IPC            {NbtSS:374, TCP:373, IPv4:3}

     

    f) Then it called RAP (Remote Administration Protocol) APIs like NetServerEnum2 etc:

     

    47630      07:30:34 08.07.2010              DC1 CLIENT1             RAP         RAP:NetServerEnum2 Request, InfoLevel = 1,  LocalList in                 {SMB:379, NbtSS:374, TCP:373, IPv4:3}

    47631      07:30:34 08.07.2010              CLIENT1             DC1 RAP         RAP:NetServerEnum2 Response, Count = 1         {SMB:379, NbtSS:374, TCP:373, IPv4:3}

     

    g) Once it got the requested info, it logged off and disconnected the TCP session:

     

    47642      07:30:34 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Logoff Andx               {NbtSS:374, TCP:373, IPv4:3}

    47643      07:30:34 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Logoff Andx               {NbtSS:374, TCP:373, IPv4:3}

    47650      07:30:34 08.07.2010              DC1 CLIENT1             SMB        SMB:C; Tree Disconnect        {NbtSS:374, TCP:373, IPv4:3}

    47651      07:30:34 08.07.2010              CLIENT1             DC1 SMB        SMB:R; Tree Disconnect        {NbtSS:374, TCP:373, IPv4:3}

    47657      07:30:34 08.07.2010              DC1 CLIENT1             TCP          TCP:Flags=...A...F, SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594373651, Ack=2981881320, Win=64407 (scale factor 0x0) = 64407    {TCP:373, IPv4:3}

    47658      07:30:34 08.07.2010              CLIENT1             DC1 TCP          TCP:Flags=...A...F, SrcPort=NETBIOS Session Service(139), DstPort=3787, PayloadLen=0, Seq=2981881320, Ack=2594373652, Win=15559 (scale factor 0x0) = 15559  {TCP:373, IPv4:3}

    47662      07:30:34 08.07.2010              DC1 CLIENT1             TCP          TCP:Flags=...A...., SrcPort=3787, DstPort=NETBIOS Session Service(139), PayloadLen=0, Seq=2594373652, Ack=2981881321, Win=64407 (scale factor 0x0) = 64407    {TCP:373, IPv4:3}

     

    h) That similar activity was seen every 12 minutes in the network trace.

     

    RESULTS:

    ========

    1) After analyzing the second network trace, the reason behind DC’s connection to clients was clear now:

    Every 12 minutes or so master browser in a network segment informs the domain master browser (which is the DC) that it’s a master browser. And then the DC connects to master browser in turn to retrieve the browse list from that master browser. You can find more details below:

    Taken from http://technet.microsoft.com/en-us/library/cc737661(WS.10).aspx How Computer Browser Service Works:

    When a domain spans multiple subnets, the master browse servers for each subnet use a unicast Master Announcement message to announce themselves to the domain master browse server. This message notifies the domain master browse server that the sending computer is a master browse server in the same domain. When the domain master browse server receives a Master Browse Server Announcement message, it returns to the “announcing” master browse server a request for a list of the server’s in that master browse server’s subnet. When that list is received, the domain master browse server merges it with its own server list.

    This process, repeated every 12 minutes, guarantees that the domain master browse server has a complete browse list of all the servers in the domain. Thus, when a client sends a browse request to a backup browse server, the backup browse server can return a list of all the servers in the domain, regardless of the subnet on which those servers are located.

    ...

     

    Hope this helps

     

    Thanks,

    Murat

     

  • Exchange servers send ICMP and UDP packets to clients or Domain Controllers, why?

    Hi,

     

    I would like to talk about a few network trace analysis cases where we were requested to find out why certain packets (spefically ICMP and UDP) were sent by Exchange servers. You’ll find below more details about how we found the processes sending those packets:

     

    a) Exchange servers sending UDP packets with random source or destination ports to various clients

     

    In one scenario, our customer’s security team wanted to find out the reason of why the Exchange servers were sending UDP packets to random clients on the network because of security concerns. There was no deterministic pattern regarding source or destination UDP ports. The only consistency was that each UDP packet sent by the Exchange servers had always 8 byte as payload. You can see a sample network trace output below:

     

    Note: Addresses were replaced for privacy purposes even though private IP address space was in use.

     

    No.     Time                       Source                Destination           Protocol Info

    ...

    105528 2010-01-14 15:20:14.454856 10.1.1.1     172.1.10.14        UDP      Source port: 35996  Destination port: mxomss

    105530 2010-01-14 15:20:14.454856 10.1.1.1     172.18.10.27        UDP      Source port: 35997  Destination port: edtools

    105531 2010-01-14 15:20:14.454856 10.1.1.1     172.17.17.95         UDP      Source port: 35998  Destination port: fiveacross

    105535 2010-01-14 15:20:14.454856 10.1.1.1     172.17.11.51         UDP      Source port: 36000  Destination port: kwdb-commn

    105540 2010-01-14 15:20:14.454856 10.1.1.1     172.23.98.97          UDP      Source port: 36003  Destination port: dicom-tls

    105541 2010-01-14 15:20:14.454856 10.1.1.1     172.24.12.8         UDP      Source port: 36004  Destination port: dkmessenger

    105542 2010-01-14 15:20:14.454856 10.1.1.1     172.28.2.52         UDP      Source port: 36005  Destination port: tragic

    105545 2010-01-14 15:20:14.454856 10.1.1.1     172.31.5.14        UDP      Source port: 36006  Destination port: xds

    105546 2010-01-14 15:20:14.454856 10.1.1.1     172.2.10.63         UDP      Source port: 36007  Destination port: 4642

    105547 2010-01-14 15:20:14.454856 10.1.1.1     172.2.35.68          UDP      Source port: 36008  Destination port: foliocorp

    105552 2010-01-14 15:20:14.454856 10.1.1.1     172.18.12.55         UDP      Source port: 36010  Destination port: saphostctrl

    105553 2010-01-14 15:20:14.454856 10.1.1.1     172.48.199.45         UDP      Source port: 36011  Destination port: slinkysearch

    105554 2010-01-14 15:20:14.454856 10.1.1.1     172.27.133.42         UDP      Source port: 36012  Destination port: oracle-oms

    105555 2010-01-14 15:20:14.454856 10.1.1.1     172.27.121.40         UDP      Source port: 36013  Destination port: proxy-gateway

    105558 2010-01-14 15:20:14.454856 10.1.1.1     172.24.7.11         UDP      Source port: 36016  Destination port: fcmsys

    ...

     

    - Source UDP port is increasing and destination UDP port seems random at first sight

    - The data part of the UDP datagrams are always 8 bytes. As an example:

     

    Frame 105540 (50 bytes on wire, 50 bytes captured)

    Ethernet II, Src: HewlettP_11:11:11 (00:1c:c4:11:11:11), Dst: All-HSRP-routers_15 (00:00:0c:07:ac:15)

    Internet Protocol, Src: 10.1.1.1 (10.1.1.1), Dst: 172.23.98.97 (172.23.98.97)

    User Datagram Protocol, Src Port: 36003 (36003), Dst Port: dicom-tls (2762)

    Data (8 bytes)

     

     

    => To better understand which process might be sending that packet, we decided to collect a kernel TCPIP trace on the source Windows 2003 server. You can find more information about methods that could be used to identify the process sending a certain packet, please see my previous post on this

     

    After collecting a network trace and an accompanying kernel TCPIP trace as described in the other post (option 4), we managed to catch the UDP packet that we see in the above network trace (actually the above network trace and the below kernel TCPIP trace were collected together). As an example:

     

    ...

           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.023.098.097, 010.001.001.001, 2762, 36003, 0, 0

           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.027.153.050, 172.023.021.024, 6004, 36009, 0, 0

           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.028.097.111, 172.023.021.024, 2344, 36016, 0, 0

           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.027.102.056, 172.023.021.024, 1116, 36022, 0, 0

    ...

     

    - For example, in the red line, 10.1.1.1 (Exchange server) is sending a UDP packet to 172.23.98.97. Packet lenght is 8 bytes and source UDP port is 36003 and destination UDP port is 2762. And process ID that is sending the UDP packet is 2136. Actually in all such UDP packets, process ID is always 2136.

    - The above line in Red taken from the kernel trace is the packet #105540 seen in the network trace

     

    => After checking the “tasklist /SVC” output, we saw that process ID 2136 was store.exe (which is Exchange Information store process):

     

    ...

    wmiprvse.exe                  5176 Console                    0      2,168 K

    mad.exe                       7176 Console                    0     45,792 K

    AntigenStore.exe             10092 Console                    0        200 K

    store.exe                     2136 Console                    0  1,040,592 K

    emsmta.exe                   12020 Console                    0     29,092 K

    ...

     

     

    => After further investigation at Exchange side with the help of an Exchange expert, we found out that this traffic was expected and was used as an E-mail notification mechanism:

     

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;811061  XCCC: Exchange Clients Do Not Receive "New Mail" Notification Messages

     

    The Information Store process (Store.exe) sends a User Datagram Protocol (UDP) packet for new mail notifications. However, because the Store process does not run on an Exchange virtual server but on the cluster node, the UDP packet is sent from the IP address of that node. If you fail over the cluster node, the data and Exchange 2000 Server virtual server configuration are moved to the Store process that is running on the other cluster server node. New mail notifications are sent from the IP address of that second cluster node.

     

    ///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

     

    b) Exchange servers sending 1 byte pings to DCs

     

    One of our customers reported that their DCs were getting constant ICMP Echo requests from a number of member servers and they wanted to get help in finding the process behind it because of security concerns. After some analysis and testing with the help of an Exchange expert colleague of mine, we found out that those ICMP echo requests were sent by the Exchange server related services. The ICMP echo request has the following characteristics:

     

    => It’s payload is always 1 byte

    => The payload itself is “3F

     

    Those ICMP echo requests cease once Exchange server related services are stopped which is another indication. This behavior is partly explained at the following article:

     

    http://support.microsoft.com/kb/270836  Exchange Server static port mappings

     

    Taken from the article:

     

    Note In a perimeter network firewall scenario, there is no Internet Control Message Protocol (ICMP) connectivity between the Exchange server and the domain controllers. By default, Directory Access (DSAccess) uses ICMP to ping each server to which it connects to determine whether the server is available. When there is no ICMP connectivity, Directory Access responds as if every domain controller were unavailable

    => You can also see a sample network trace output collected on an Exchange server: 

     

    Hope this helps

     

    Thanks,

    Murat 

  • Why does it take too long to retrieve data from SQL server over a WAN connection?

    Hi there,

     

    In today’s blog post, I’m going to talk about how to troubleshoot a WAN performance problem by using simultaneous network traces.

     

    One of customers reported that they were having performance problems when retrieving data from a SQL server running in a remote subnet. Interestingly the problem wasn’t in place when the client and SQL server were running in the same subnet J. Even though it was obvious from that testing that the WAN connectivity had something to do with the problem, our customer wanted to make sure that there were no issues at the client side and server side.

     

    In such problem scenarios, the first thing we ask for is simultaneous network traces collected while reproducing the problem. Generally just looking at a one side network trace might yield incorrect results and may result in looking at the wrong direction. From that perspective, simultaneous network traces are a must.

     

    Provided that simultaneous network traces are collected, how are you going to troubleshoot this problem? Well, there could be a number of approaches but I generally use the following method:

     

    a) Find the same session (this will be most likely a TCP or UDP session) at both side traces. This may be a bit harder than expected especially if there’s NAT between the client and server side. Otherwise it should be an easy one.

     

    b) After correctly identifying the same session at both sides, your next task will be to find the packet delays in the traces. The most appropriate way to do so is to order packets in the network trace based on time delta (time between two packets that are displayed back to back in a network trace).

     

    c) After that, you can identify the highest time deltas within the trace (for example you can re-order packets based on time delta). If the problem is a WAN delay issue, this will generally look like that:

     

    => At the client side trace, most of the delays are seen in packets that are sent by the server side

    => At the server side trace, most of the delays are seen in packets that are sent by the client side.

     

    In other words, from client perspective, it’s Server that is causing the delays and hence the performance problem. From server perspective, it’s Client that’s causing the delays and hence the performance problem. Actually both sides are WRONG! This stems from the WAN delay J. I think this may also underline the importance of collecting simultaneous network traces in such scenarios otherwise you will get the wrong results..

     

    Well, now let’s take a look at an example to better understand this:

     

    EXAMPLE:

    ========

    Client is having performance problems in retrieving data from the SQL server and this is only visible when the client and server are running at different subnets. There’s also NAT (network address translation between the client and server). We have the IP address information and simultaneous network traces collected while reproducing the problem:

     

    A. IP address information:

     

    => From client side perspective:

     

    Client IP:             192.168.1.66

    Server IP:            10.1.1.7

     

    => From server side perspective:

     

    Client IP:             172.16.5.14

    Server IP:            192.168.200.148

     

    Note: IP address information is changed deliberately for data privacy

     

     

    B. The session in which we see the performance problem:

     

    Since this was a SQL server connectivity problem, it was obvious that the client should be contacting the SQL server at TCP port 1433 (if the default configuration wasn’t changed). After a few checks I found the following session in the simultaneous network traces:

     

    Wireshark filter for Client side session:

    (ip.addr eq 192.168.1.66 and ip.addr eq 10.1.1.7) and (tcp.port eq 46739 and tcp.port eq 1433)

     

    Wireshark filter for the same session seen at Server side trace:

    (ip.addr eq 172.16.5.14 and ip.addr eq 192.168.200.148) and (tcp.port eq 63584 and tcp.port eq 1433)

     

     

    Now let’s look at how the communication is seen from the client perspective and server perspective:

     

    C. From client perspective:

     

    No.     Time            Source                    Destination             Protocol Info

        406 20.541677   192.168.1.66           10.1.1.7                  TCP      46739 > 1433 [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=11322755 TSER=0 WS=6

        407 0.089037    10.1.1.7                    192.168.1.66          TCP      1433 > 46739 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1380 WS=8 TSV=359429749 TSER=11322755

        408 0.000036    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=11322777 TSER=359429749

        409 0.000142    192.168.1.66            10.1.1.7                  TDS      TDS7/8 0x12 Packet

        410 0.089013    10.1.1.7                    192.168.1.66          TDS      Response Packet

        411 0.000058    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=27 Ack=27 Win=5888 Len=0 TSV=11322799 TSER=359429758

        412 0.000861    192.168.1.66            10.1.1.7                  TDS      TDS7/8 0x12 Packet

        413 0.091710    10.1.1.7                    192.168.1.66          TDS      TDS7/8 0x12 Packet

        414 0.006916    192.168.1.66            10.1.1.7                  TDS      TDS7/8 0x12 Packet

        415 0.093413    10.1.1.7                    192.168.1.66          TDS      TDS7/8 0x12 Packet

        416 0.002583    192.168.1.66            10.1.1.7                  TDS      Unknown Packet Type: 23[Unreassembled Packet]

        425 0.091253    10.1.1.7                    192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        426 0.000434    192.168.1.66            10.1.1.7                  TDS      Query Packet

        427 0.089305    10.1.1.7                    192.168.1.66          TDS      Response Packet

        428 0.000330    192.168.1.66            10.1.1.7                  TDS      Query Packet

        429 0.088216    10.1.1.7                    192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        431 0.040908    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=1040 Ack=1372 Win=9280 Len=0 TSV=11322926 TSER=359429804

        432 0.174717    192.168.1.66            10.1.1.7                  TDS      Remote Procedure Call Packet

       433 0.103898    10.1.1.7                     192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        434 0.000056    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=1947 Ack=1875 Win=10944 Len=0 TSV=11322995 TSER=359429835

        467 0.062827    192.168.1.66            10.1.1.7                  TDS      Remote Procedure Call Packet[Unreassembled Packet]

        473 0.091735    10.1.1.7                    192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        474 0.000061    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=1992 Ack=1901 Win=10944 Len=0 TSV=11323034 TSER=359429852

        475 0.008514    192.168.1.66            10.1.1.7                  TDS      Remote Procedure Call Packet

        476 0.093648    10.1.1.7                    192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        477 0.001301    192.168.1.66            10.1.1.7                  TDS      Remote Procedure Call Packet[Unreassembled Packet]

        478 0.087454    10.1.1.7                    192.168.1.66          TDS      Response Packet[Unreassembled Packet]

        479 0.039780    192.168.1.66            10.1.1.7                  TCP      46739 > 1433 [ACK] Seq=2944 Ack=2399 Win=12608 Len=0 TSV=11323092 TSER=359429871

    ...

     

    From client’s perspective, responses from SQL server side come with a delay approximately 0.08 seconds or so. The client sends the next query without waiting...

     

    D. From server perspective:

     

    No.     Time        Source                Destination           Protocol Info

         44 21.123385   172.16.5.14              192.168.200.148       TCP      63584 > 1433 [SYN] Seq=0 Win=5840 Len=0 MSS=1380 TSV=11322755 TSER=0 WS=6

         45 0.000039    192.168.200.148       172.16.5.14TCP      1433 > 63584 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 TSV=359429749 TSER=11322755

         46 0.089015    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=1 Ack=1 Win=5888 Len=0 TSV=11322777 TSER=359429749

         47 0.000461    172.16.5.14               192.168.200.148       TDS      TDS7/8 0x12 Packet

         48 0.000117    192.168.200.148       172.16.5.14TDS      Response Packet

         49 0.087810    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=27 Ack=27 Win=5888 Len=0 TSV=11322799 TSER=359429758

         50 0.001553    172.16.5.14               192.168.200.148       TDS      TDS7/8 0x12 Packet

         51 0.000427    192.168.200.148       172.16.5.14TDS      TDS7/8 0x12 Packet

         52 0.099177    172.16.5.14               192.168.200.148       TDS      TDS7/8 0x12 Packet

         53 0.004255    192.168.200.148       172.16.5.14TDS      TDS7/8 0x12 Packet

         54 0.091772    172.16.5.14               192.168.200.148       TDS      Unknown Packet Type: 23[Unreassembled Packet]

         55 0.001104    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         56 0.089833    172.16.5.14               192.168.200.148       TDS      Query Packet

         57 0.000183    192.168.200.148       172.16.5.14TDS      Response Packet

         58 0.089154    172.16.5.14               192.168.200.148       TDS      Query Packet

         59 0.000150    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         60 0.128460    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=1040 Ack=1372 Win=9280 Len=0 TSV=11322926 TSER=359429804

         61 0.178927    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet

         62 0.001507    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         63 0.098217    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=1947 Ack=1875 Win=10944 Len=0 TSV=11322995 TSER=359429835

         64 0.066525    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         65 0.000168    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         66 0.089069    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=1992 Ack=1901 Win=10944 Len=0 TSV=11323034 TSER=359429852

         67 0.011578    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet

         68 0.001386    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         70 0.089248    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         71 0.000147    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         72 0.129948    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=2944 Ack=2399 Win=12608 Len=0 TSV=11323092 TSER=359429871

         73 0.002327    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet

         74 0.001011    192.168.200.148       172.16.5.14TDS      Response Packet

         75 0.088519    172.16.5.14               192.168.200.148       TCP      63584 > 1433 [ACK] Seq=3323 Ack=2497 Win=12608 Len=0 TSV=11323115 TSER=359429884

         76 0.003928    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         77 0.000144    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         78 0.097206    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet

         79 0.001297    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         80 0.089896    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         81 0.000147    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         82 0.099239    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         83 0.002980    172.16.5.14               192.168.200.148       TDS      Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]

         84 0.000025    192.168.200.148       172.16.5.14TCP      1433 > 63584 [ACK] Seq=2869 Ack=6121 Win=65536 Len=0 TSV=359429923 TSER=11323187

         85 0.000272    192.168.200.148       172.16.5.14TDS      Response Packet

         86 0.091009    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         87 0.000145    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         88 0.125878    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         89 0.002955    172.16.5.14               192.168.200.148       TDS      Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]

         90 0.000022    192.168.200.148       172.16.5.14TCP      1433 > 63584 [ACK] Seq=3783 Ack=8147 Win=65536 Len=0 TSV=359429945 TSER=11323242

         91 0.000233    192.168.200.148       172.16.5.14TDS      Response Packet

         92 0.095498    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         93 0.000187    192.168.200.148       172.16.5.14TDS      Response Packet[Unreassembled Packet]

         94 0.111501    172.16.5.14               192.168.200.148       TDS      Remote Procedure Call Packet[Unreassembled Packet]

         95 0.005919    172.16.5.14               192.168.200.148       TDS      Unknown Packet Type: 0 (Not last buffer)[Unreassembled Packet]

         96 0.000019    192.168.200.148       172.16.5.14TCP      1433 > 63584 [ACK] Seq=4760 Ack=10928 Win=65536

     

    When checking the trace from server perspective, SQL server returns TDS responses almost always immediately. But the TDS requests from the client side come to SQL server with an average delay of 0.08 seconds or so.

     

    RESULTS:

    ========

    1) From client’s perspective, SQL server is slow in sending TDS responses

     

    2) From SQL server’s perspective, client is slow in sending TDS requests.

     

    3) The real problem is that TDS performance issue was stemming from WAN connectivity delay (like link delay/network device delays etc). In this instance the average WAN delay was about 80 milliseconds (0.08 seconds). Even though it looks like a small number at first look, considering that you may be exposed to that small delay let’s say 100 times while making a transaction, this could easily become an issue (like 8 seconds). So never underestimate such small values J

     

    Hope this helps

     

    Thanks,

    Murat

  • Excessive ARP requests, is it malware or legitimate activity?

    Hi there,

     

    In this blog post, I would like to talk about a network trace analysis scenario where we were requested to find out if excessive ARP requests were legitimate or not.

     

    Here are more details about the problem:

     

    Problem description:

    ==================

    Some Windows Vista clients experience network connectivity problems. After the client gets connected to the network, it is disconnected after a short while like 1 minute or so. After some initial troubleshooting, it was found out that the connectivity problem occurred because the switch port was disabled. And our customer’s network team stated that the switch port was disabled because excessive ARP requests were received from the client in a very short amount of time (like getting 10+ ARP requests within 10 milliseconds from the same source MAC etc).

     

    How we did the troubleshooting:

    =============================

    We collected a network trace to better understand the root cause of the problem from one of those client machines. But the problem was that we were not able to see what the client would do after resolving those MAC addresses by ARP protocol because the switch was shutting down the port due to the mentioned security setting.

     

    It’s also very difficult to associate an ARP packet to a process on a Windows machine by using the methods mentioned in my earlier post. So I decided to use the method 5 mentioned in the same post). I requested the customer to disable that security setting on a switch port where the problem was seen. So that we would be able to see what the client would do after resolving the MAC addresses via ARP protocol. We collected a new network trace after making such a configuration change and got to the bottom of the problem afterwards:

     

    a) In the new network trace, it was clearly seen again that the client was sending excessive ARP requests (20 ARP requests sent for different IP addresses within the same millisecond!)

     

    Wireshark filter:

    arp.src.proto_ipv4 == 10.0.80.1 and arp.opcode==0x1

     

    No.     Time        Source                Destination           Protocol Info

    326437 30.192068   HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.202?  Tell 10.0.80.1

    326438 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.165?  Tell 10.0.80.1

    326440 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.83?  Tell 10.0.80.1

    326441 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.59?  Tell 10.0.80.1

    326442 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.220?  Tell 10.0.80.1

    326443 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.215?  Tell 10.0.80.1

    326451 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.28?  Tell 10.0.80.1

    326453 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.60?  Tell 10.0.80.1

    326454 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.127?  Tell 10.0.80.1

    326455 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.81?  Tell 10.0.80.1

    326458 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.29?  Tell 10.0.80.1

    326462 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.98?  Tell 10.0.80.1

    326465 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.65?  Tell 10.0.80.1

    326467 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.129?  Tell 10.0.80.1

    326468 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.37?  Tell 10.0.80.1

    326470 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.239?  Tell 10.0.80.1

    326471 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.15?  Tell 10.0.80.1

    326472 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.26?  Tell 10.0.80.1

    326473 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.235?  Tell 10.0.80.1

    326475 0.000000    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.77?  Tell 10.0.80.1

     

    - There're 20 ARP requests sent in a row in the same millisecond

     

    b) Then I decided to check a few actions that the client would take after resolving the MAC address. I started with the first one (10.0.80.202)

     

    Wireshark filter:

    ((arp.src.proto_ipv4==10.0.80.202 and arp.dst.proto_ipv4==10.0.80.1) or (arp.src.proto_ipv4==10.0.80.1 and arp.dst.proto_ipv4==10.0.80.202)) or (ip.addr==10.0.80.202 and ip.addr==10.0.80.1)

     

    => The client first resolves the MAC address of 10.0.80.202 about 2 minutes before the problem:

     

    No.     Time        Source                Destination           Protocol Info

      62994 2.518411    HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.202?  Tell 10.0.80.1

      62996 0.000000    HewlettP_11:21:6b     HewlettP_10:11:91     ARP      10.0.80.202 is at 00:23:7d:13:2f:6b

     

    => Then the client establishes a TCP session to TCP port 5357 and exchange some data with that system:

     

      62997 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=2

      63000 0.000000    10.0.80.202           10.0.80.1            TCP      wsdapi > 59327 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8

      63001 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [ACK] Seq=1 Ack=1 Win=65700 Len=0

      63008 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [PSH, ACK] Seq=1 Ack=1 Win=65700 Len=223

      63009 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [PSH, ACK] Seq=224 Ack=1 Win=65700 Len=658

      63013 0.000000    10.0.80.202           10.0.80.1            TCP      wsdapi > 59327 [ACK] Seq=1 Ack=882 Win=65536 Len=0

      63133 0.046800    10.0.80.202           10.0.80.1            TCP      wsdapi > 59327 [ACK] Seq=1 Ack=882 Win=65536 Len=1460

      63134 0.000000    10.0.80.202           10.0.80.1            TCP      wsdapi > 59327 [PSH, ACK] Seq=1461 Ack=882 Win=65536 Len=908

      63136 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [ACK] Seq=882 Ack=2369 Win=65700 Len=0

     

    => Then the client goes into idle state in that TCP session for about 115 seconds. And then it resolves the MAC address of 10.0.80.202 once more (most likley the the ARP cache entry was deleted so source TCPIP stack needed to re-resolve the MAC address:

     

    326437 115.073211  HewlettP_10:11:91     Broadcast             ARP      Who has 10.0.80.202?  Tell 10.0.80.1

    326483 0.000000    HewlettP_11:21:6b     HewlettP_10:11:91     ARP      10.0.80.202 is at 00:23:7d:13:2f:6b

     

    => Right after resolving the MAC address of the target system once more, the client immediately forcefully closes the TCP session with a TCP RST segment:

     

    326484 0.000000    10.0.80.1            10.0.80.202           TCP      59327 > wsdapi [RST, ACK] Seq=882 Ack=2369 Win=0 Len=0

     

    c) After doing the same check for the other IP addresses that the client was trying to resolve the MAC addresses for, I saw that the behavior was the same (the client was connecting to another target at TCP port 5753 and was exchanging some data with that system and after staying idle for about 2 minutes, was re-resolving the MAC address and was forcefully closing the TCP session with a TCP RST.

     

    SUMMARY:

    =========

    1) After analyzing the network trace as described above, it was more clear why the client was sending soo many ARP requests within the same milliseconds to various other targets:

     

    To forcefully close its existing TCP sessions established to TCP port 5753 at those target machines.

     

    2) The next question was that if that activity was legitimate or not. After doing some more research on the related protocol, I came up with the following results:

     

    The activity on these TCP sessions (which were forcefully closed by the client) was generated by WSDAPI protocol. Also the data within the TCP segments confirmed that the activity was legitimate (seeing meaningful XML requests and responses in the traces as documented below)

     

    http://msdn.microsoft.com/en-us/library/ee886312(VS.85).aspx Get (Metadata Exchange) HTTP Request and Message

    http://msdn.microsoft.com/en-us/library/ee886311(VS.85).aspx GetResponse (Metadata Exchange) Message

     

    This protocol seems to be used for discovering printers, Web cameras, video systems and manipulating them remotely. More details could be seen at the following articles:

     

    http://msdn.microsoft.com/en-us/library/aa385800(VS.85).aspx About Web Services on Devices

    http://blogs.msdn.com/dandris/archive/2007/11/09/wsdapi-101-part-1-web-services-from-10-000-feet.aspx

     

    3) Even though it wasn’t very clear why the client was closing many WSDAPI sessions at the same time, it was now clear that  it was a legitimate activity and our customer wanted to take it further from that point on.

     

    Hope this helps

     

    Thanks,

    Murat

  • Is it real or Matrix? Some facts about network traces...

    Hi there,

     

    In this blog post, I would like to talk about some facts about network traces. A while ago we were discussing how much we could depend on network trace data when it comes to those question: “The outgoing packet that we see in a network trace, did it really hit the wire?”, “Not seeing a packet that was supposed to be received by that system in a network trace collected on that system, does it really mean that the machine didn’t physically receive the packet?”

     

    Before going to answers, it would be better to clarify the network trace capturing architecture so that you can give the answers for yourself:

     

    NETWORK STACK VIEW IN GENERAL:

     

    Upper layer stacks (Winsock etc)

    ==================================

    NDIS Protocol drivers (Example: TCPIP)

    -------------------------------------------------------

    NDIS Intermediate (NDIS 5.x) or NDIS Lightweight filter drivers (NDIS 6.x)

    (Examples: NLB / QoS / Teaming / 802.1q tagging / packet capturing agent / NDIS level firewalls etc)

    -------------------------------------------------------

    NDIS Miniport drivers (Example: NIC drivers)

    ==================================

    Hardware NIC

    ==================================

               |

               |

    ----------------- Wired/wireless

     

     

    Note: Microsoft Network Monitor capturing agent runs as an NDIS lightweight filter driver as of Windows Vista onwards.

    Note: Wireshark capturing agent (NPF/WinPcap) runs as an NDIS protocol driver. You can find more information at the below link:

    http://www.winpcap.org/docs/docs_411/html/group__NPF.html

     

     

    Well, after taking a look at the above architecture I think we can easily answer the questions I asked at the beginning of the post (and also some additional questions). May be we need to add one more thing to fully explain this: the order of NDIS intermediate or NDIS lightweight filter drivers. Actually this is not an easy one to answer but we can say from experience that generally NLB, teaming, NDIS level firewall drivers etc run at below packet capturing agent (for Network Monitor scenario. For Wireshark, it looks like all those run below Wireshark capturing engine winpcap since winpcap runs as an NDIS protocol driver).

     

    QUESTIONS & ANSWERS

     

    Q1. The outgoing packet that we see in a network trace, did it really hit the wire?

    A1. No, we cannot say it for sure. After the capturing engine got a copy of the packet, the packet could have been dropped by a lower layer NDIS intermediate or lightweight filter driver like teaming, NDIS level firewall, NIC driver etc...

     

    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

     

    Q2. Not seeing a packet in a network trace that was supposed to be received by that system, does it really mean that the machine didn’t physically receive the packet?

    A2. No, we cannot say for sure. The packet might have been dropped by a lower level NDIS miniport driver or NDIS intermediate/lightweight filter driver before it reaches capturing engine on the same box

     

    (Examples: You cannot see packets on an NLB node if that node isn’t supposed to handle that packet. You may not see the packet because a lower level NDIS driver)

     

    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

     

    Q3. Seeing a packet received in a network trace, does it really mean that the upper layers (like TCPIP/Winsock/Application/etc) on this machine has *seen* that packet?

    A3. No, we cannot say for sure. The packet might have been dropped by an upper layer NDIS intermediate/lightweight filter driver or an upper layer firewall or similar security softwares (intrusion detection/URL filtering/etc)

     

    (Examples: For example you see incoming TCP SYNs in a network trace but you don’t see the TCP SYN ACKs being sent in response. This generally stems from a firewall/similar security driver running somewhere at the top of capturing engine. Since the packet is dropped before TCPIP driver had a chance to process it, no response is sent back to the connecting party. Another example could be a higher level filter again that prevent the request from being seen by the application (like winsock level filters)

     

    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

     

    Q4. I see that there’re TCP segments or IP datagrams with incorrect checksums, does that really mean the packet is problematic and causing the performance/connectivity issue that I see?

    A4. The answer is it depends. If you see those checksum errors for packets sent by A to B and the trace is collected on A and TCP/IP checksum offloading is enabled (by default for many NICs) then it doesn’t indicate a problem. May we should also explain why we see it that way:

     

    TCP/IP header checksum offloading is a performance feature supported by many NICs. With this feature, TCPIP driver pass the packet down to NIC driver without calculating the checksum for IP or TCP headers. Since the capturing engine *sees* the packet before the NIC driver, it concludes that the checksum related fields are incorrect (which is normal because those checksums will be calculated by NIC driver before the packet hits the wire). For other situations, this would indicate a problem.

     

    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

     

    Q5. I sometimes see packets whose payloads are greater than standard Ethernet MTU in a network trace. How could that be possible?

    A5. Some NIC drivers support LSO - Large Send offloading feature which is another optimization. Large send offloading allows TCPIP driver to offload TCP data segmentation (fragmentation) to NIC driver to accelarate the process. Also please note that in some instances, generally due to NIC driver issues, LSO might cause performance problems. You can find more information about this feature at the following link:

     

    http://www.microsoft.com/whdc/device/network/taskoffload.mspx Windows Network Task Offload

     

     

    Hope this helps.

     

    Thanks,

    Murat

  • Where are my packets? Analyzing a packet drop issue...

    One of the most common reasons for network connectivity or performance problems is packet drop. In this blog post, I’ll be talking about analyzing a packet drop issue, please read on.

     

    One of customers was complaining about remote SCCM agent policy updates and it was suspected a network packet drop issue. Then we were involved in to analyze the problem from networking perspective. Generally such problems might stem from the following points:

     

    a) A problem on the source client (generally at NDIS layer or below). For additional information please see a previous blog post.

     

    b) A problem stemming from network itself (links, firewalls, routers, proxy devices, encryption devices, switches etc). This is the most common problem point in such cases.

     

    c) A problem on the target server (generally at NDIS layer or below). For additional information please see a previous blog post.

     

    In packet drop issues, the most important logs are simultaneous network traces collected on the source and target systems.

     

    You’ll find below more details about how we got to the bottom of the problem:

     

    NETWORK TRACE ANALYSIS:

    =======================

    In order for a succesful network trace analysis, you need to be familiar with the technology that you’re troubleshooting. At least you should have some prior knowledge on the network activity that the related action (like how SCCM agent retrieves policies from SCCM server in general for this exampl) would generate. In this instance, after a short discussion with an SCCM colleague of mine, I realized that SCCM agent sends an HTTP POST request to the SCCM server to retrieve the policies. I analyzed the network traces in the light of this fact:

     

    How we see the problematic session at SCCM agent side trace (10.1.1.1):

     

    No.     Time        Source                Destination           Protocol Info

      1917 104.968750  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

       1918 105.000000  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

       1919 105.000000  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

       1920 105.000000  10.1.1.1        172.16.1.1            HTTP     CCM_POST /ccm_system/request HTTP/1.1

       1921 105.000000  10.1.1.1        172.16.1.1            HTTP     Continuation or non-HTTP traffic

       1922 105.000000  10.1.1.1        172.16.1.1            HTTP     Continuation or non-HTTP traffic

       1925 105.140625  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 Len=0 SLE=1742 SRE=2328

       1975 107.750000  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2071 112.890625  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2264 123.281250  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2651 144.171875  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       3475 185.843750  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       4392 234.937500  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

     

    How we see the problematic session at SCCM Server side trace (172.16.1.1):

     

    No.     Time        Source                Destination           Protocol Info

      13587 100.765625  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

      13588 0.000000    172.16.1.1            10.1.1.1        TCP      http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

      13591 0.031250    10.1.1.1        172.16.1.1            TCP      rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 Len=0

      13598 0.031250    10.1.1.1        172.16.1.1            HTTP     CCM_POST /ccm_system/request HTTP/1.1

      13611 0.078125    10.1.1.1        172.16.1.1            HTTP     [TCP Previous segment lost] Continuation or non-HTTP traffic

      13612 0.000000    172.16.1.1            10.1.1.1        TCP      http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 [TCP CHECKSUM INCORRECT] Len=0 SLE=1742 SRE=2328

      30509 129.812500  172.16.1.1            10.1.1.1        TCP      http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

     

     

    Explanation on color coding:

     

    a) GREEN packets are the ones that we both see at client and server side traces. In other words, those are packets that were either sent by the client and successfully received by the server or sent by the server and successfully received by the client.

     

    b) RED packets are the ones that were sent by the client but not received by the server. To further provide details on this part:

    - The frame #1921 that we see at the client side trace (which is a continuation packet for the HTTP POST request) is not visible at the server side trace

    - The frames #1975, 2071, 2264, 2651 and 3475 are retransmissions of frame #1921. We cannot even see those 5 retransmissions of the same original TCP segment at the server side trace which means most likely all those retransmission segments were lost on the way to the server even though there’s still a minor chance that they might have been physically received by the server but dropped by an NDIS level driver (see this post for more details)

     

    The server side forcefully closes the session approximately after 2 minutes. (Client side frame #4392 and server side frame #30509)

     

    RESULTS:

    ========

    1) The simultaneous network trace analysis better indicated that this was a packet drop issue.

     

    2) The problem most likely stems from one of these points:

     

    a) An NDIS layer or below problem at the SCCM agent side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

    b) A network related problem (link or network device problem like router/switch/firewall/proxy issue)

    c) An NDIS layer or below problem at the SCCM server side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

     

    3) In such scenarios we may consider doing the following at the client and server side:

     

    a) Updating NIC/teaming drivers

    b) Updating 3rd party filter drivers (AV/Firewall/Host IDS software/other kind of filtering devices) or temporarily uninstalling them for the duration of troubleshooting

     

    4) Most of the time such problems stem from network itself. And you may consider taking the following actions on that:

     

    a) Checking cabling/switch port

    b) Checking port speed/duplex settings (you may try manually setting to 100 MB/ FD for example)

    c) Enabling logging on Router/Switch/firewall/proxy or similar devices and check interface statistics or check problematic sessions to see if there are any packet drops

    d) Alternatively you may consider collecting network traces at more than 2 points (sending & receiving ends). For example, 4 network traces could be collected: at the source, at the target and at two intermediate points. So that you can follow the session and may isolate the problem to a certain part of your network.

     

    For your information, the specific problem I mentioned in this blog post was stemming from a router running in between the SCCM agent and server.

     

    Hope this helps

     

    Thanks,

    Murat

     

  • Does sqllogship.exe have anything to do with web servers in the internet? Story behind CRL check for certificates ...

    Hi there,

     

    In today’s blog post I’m going to talk about a network trace analysis scenario where I was requested to analyze a few network traces to understand why a server was trying to contact external web servers when sqllogship.exe was run on it.

     

    Our customer’s security team noticed that there were http connection attempts coming from internal SQL servers where those servers wouldn’t supposed to be connecting any external servers. The only thing they were running was something like that:

     

    "C:\Program Files\Microsoft SQL Server\90\Tools\Binn\sqllogship.exe" -Backup 1B55E77D-A000-1EE8-9780-441096E2151 -server PRODDB

     

    And in every attempt there were blocked http connections seen on the firewall. Since we didn’t know what the server would do after establishing such an HTTP connection to an external network we weren’t able to make much comment on this. I requested our customer to let the firewall allow such an http connection so that we would be able to get more information after the connection was established, this is a method (method 5) I mentioned in one of my earlier posts

     

    After our customer made such a change and re-collected a network trace on the SQL server, it was now more clear why the SQL server was attempting to connect to a remote web server: To verify if the certificate was revoked or not by downloading the CRL (certificate revocation list):

     

     

    => SQL server first resolves the IP address for the name: crl.microsoft.com

     

    No.     Time                       Source                Destination           Protocol Info

    23519    2010-06-26 09:23:14.560786        10.11.1.11           10.1.1.1         DNS       Standard query A crl.microsoft.com

    23520    2010-06-26 09:23:14.561000        10.1.1.1                    10.11.1.11     DNS       Standard query response CNAME crl.www.ms.akadns.net

    |-> crl.microsoft.com: type CNAME, class IN, cname crl.www.ms.akadns.net

    |-> crl.www.ms.akadns.net: type CNAME, class IN, cname a1363.g.akamai.net

    |-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.18

    |-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.50

     

    => SQL server establishes a TCP session to port 80 at the remote web server running on 193.45.15.50:

     

    No.     Time                       Source                Destination           Protocol Info

      69679 2010-06-26 09:24:37.466403          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1460

      69697 2010-06-26 09:24:37.554390          193.45.15.50   10.11.1.11                TCP      80 > 2316 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460

      69698 2010-06-26 09:24:37.554407          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

     

    => After the TCP 3-way handshake, the SQL server sends an HTTP GET request to the web server to retrieve the CSPCA.crl file

     

    No.     Time                       Source                Destination           Protocol Info

      69699 2010-06-26 09:24:37.554603          10.11.1.11            193.45.15.50       HTTP     GET /pki/crl/products/CSPCA.crl HTTP/1.1

        |-> GET /pki/crl/products/CSPCA.crl HTTP/1.1\r\n

        |-> User-Agent: Microsoft-CryptoAPI/5.131.3790.3959\r\n

        |-> Host: crl.microsoft.com\r\n

     

      69729 2010-06-26 09:24:37.642219          193.45.15.50   10.11.1.11                TCP      80 > 2316 [ACK] Seq=1 Ack=199 Win=6432 Len=0

      69731 2010-06-26 09:24:37.645483          193.45.15.50   10.11.1.11                PKIX-CRL Certificate Revocation List

        |-> HTTP/1.1 200 OK\r\n

    ...

        |-> Certificate Revocation List

        |-> signedCertificateList

        |-> algorithmIdentifier (shaWithRSAEncryption)

     

     

    Note: It looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/944752)

    When the Microsoft .NET Framework 2.0 loads a managed assembly, the managed assembly calls the CryptoAPI function to verify the Authenticode signature on the assembly files to generate publisher evidence for the managed assembly.”

     

     

    => Similarly the server sends another HTTP GET request to retrieve CodeSignPCA.crl:

     

    No.     Time                       Source                Destination           Protocol Info

      77631 2010-06-26 09:24:52.642968          10.11.1.11            193.45.15.50                       HTTP     GET /pki/crl/products/CodeSignPCA.crl HTTP/1.1

      77747 2010-06-26 09:24:52.733106          193.45.15.50   10.11.1.11                PKIX-CRL Certificate Revocation List

      78168 2010-06-26 09:24:53.011176          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [ACK] Seq=403 Ack=1961 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

    ...

     

     

    Note: Again it looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/947988 You cannot install SQL Server 2005 Service Pack 1 on a SQL Server 2005 failover cluster if the failover cluster is behind a firewall)

     

    When the Microsoft .NET Framework starts SSIS, the .NET Framework calls the CryptoAPI function. This function determines whether the certificates that are signed to the SQL Server assembly files are revoked. The CryptoAPI function requires an Internet connection to check the following CRLs for these certificates:

    http://crl.microsoft.com/pki/crl/products/CodeSignPCA.crl

    http://crl.microsoft.com/pki/crl/products/CodeSignPCA2.crl”

     

    It looks like there’re a number of solutions to prevent such CRL checks like changing “generatePublisherEvidence” or “Check for publisher’s certificate revocation” as explained in KB944752 or KB947988.

     

    Hope this helps

     

    Thanks,

    Murat

     

  • Where have those AFD driver related registry (DynamicBacklogGrowthDelta / EnableDynamicBacklog ...) keys gone?

    Hi there,

     

    In today’s blog post, I’m going to talk about some registry keys that were removed as of Windows 2008. Recently a colleague raised a customer question about configuring the following AFD related registry keys on Windows 2008:

     

    DynamicBacklogGrowthDelta
    EnableDynamicBacklog
    MaximumDynamicBacklog
    MinimumDynamicBacklog

     

    Actually our customer was trying to implement the settings mentioned in How To: Harden the TCP/IP Stack. But none of our documentation on Vista/2008/Windows 7 and Windows 2008 R2 referred to such AFD related registry keys.

     

    A quick source code review revealed that those registry keys weren’t valid as of Windows 2008. Those registry keys were mainly used to encounter TCP SYN attacks at Winsock layer on Windows 2003. Since SYN attack protection was built-in on Windows Vista, 2008, 2008 R2 or Windows 7 (and even couldn’t be disabled - please see this blog post for more information on TCP SYN attack protection on Windows Vista/2008/2008 R2/7), it wasn’t required to deal with SYN attacks at Winsock layer and as a result of that, the logic and the registry keys were removed from AFD driver.

     

    As an additional note here, I also wouldn’t recommend implementing EnablePMTUDiscovery registry key which is also mentioned in the document How To: Harden the TCP/IP Stack because of reasons mentioned in a previous blog post. Also SYN attack protection related registry keys mentioned in the same article don’t apply to Window Vista onwards.

     

    Hope this helps

     

    Thanks,

    Murat

  • Why are local resources accessed slowly when loopback or local IP address is used whereas accessing the same resources over the network works fine?

    Hi there,

     

    In today’s blog post, I’m going to talk about a local resource access performance issue.

     

    One of our customers reported that SQL server instances running on a Windows 2003 server were failing over to another node especially when CPU load on the server was high for a short period (like 5 seconds or so). After some research by our SQL expert, it was determined that the fail over occurred because the IsAlive() function implemented by the SQL server does a “SELECT @@SERVERNAME” T-SQL query to the local SQL server instance by using the local IP address and that query doesn’t return in a timely manner.

     

    When loopback interface was monitored during the problem time, it was also seen that “Output Queue Length” for the MS TCP loopback interface was dramatically increasing and then dropping down to 0 and then dramatically increase again like below:

    Where access to local SQL instance was too slow as such, the same SQL instance could be accessed without issues from remote clients. That behavior suggested that there was a problem with loopback access.

     

    RESOLUTION:

    ===========

    One of our senior escalation engineers (Thanks to PhilJ) mentioned that loopback access (accessing to 127.0.0.1 or any other local IP addresses) simply queues up Work items to be later processed by a function in AFD driver. The work items queued that way are later processed by a worker thread in kernel which runs a function provided by AFD driver. Normally system can create up to 16 of such dynamic worker threads and those worker threads run at a priority level of 12. If the system had a higher priority work and if it wasn’t possible to create new worker threads, then the problem might be visible as explained above.

     

    There’s a way to make more delayed worker threads available initially which could be configured as follows:

     

    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Executive

    AdditionalDelayedWorkerThreads  (REG_DWORD)

     

    You can find more information at the following link:

    http://msdn.microsoft.com/en-us/library/ee377058(BTS.10).aspx Registry Settings that can be Modified to Improve Operating System Performance

     

    Even though the AdditionalDelayedWorkerThreads registry key was set to a higher value than default, the problem was still in place. Finally it was decided to get a kernel dump to better understand what was going wrong with those Delayed worker threads so that local access was too slow. Before going to the conclusion part, it would be good to state that this local resource access problem was not specific to SQL and could be experienced with any application locally accessing a resource (there were similar issues reported by other customers like “slow performance seen when local applications accessing local web services, databases” etc)

     

    Finally a kernel dump has revealed the real problem why the delayed worker threads couldn’t catch up with the work items being posted:

     

    A 3rd party Firewall driver was posting many delayed work items and delayed worker threads processing these work items were all trying to synchronize on the same executive resource (which was apparently a contention point) and hence the system was hitting the ceiling for the number of worker threads quickly so that new items couldn’t processed in a timely manner and local resource access was being too slow. I just wanted to give a sample call stack for one of those delayed worker threads for your reference:

     

    THREAD fffffaee4460aaf0  Cid 0004.0044  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (WrResource) KernelMode Non-Alertable

        fffffadc3e9dc0a0  SynchronizationEvent

        fffffadc44605ca8  NotificationTimer

    Not impersonating

    DeviceMap                 fffffa80000036d0

    Owning Process            fffffadc44622040       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      294068         Ticks: 6 (0:00:00:00.093)

    Context Switch Count      200681            

    UserTime                  00:00:00.000

    KernelTime                00:00:03.500

    Start Address nt!ExpWorkerThread (0xfffff800010039f0)

    Stack Init fffffadc1f1c1e00 Current fffffadc1f1c1950

    Base fffffadc1f1c2000 Limit fffffadc1f1bc000 Call 0

    Priority 13 BasePriority 12 PriorityDecrement 1

    Child-SP          RetAddr           : Args to Child                                                           : Call Site

    fffffadc`1f1c1990 fffff800`01027682 : fffffadc`1edcb910 fffffadc`1edeb180 00000000`0000000b fffffadc`1ed2b180 : nt!KiSwapContext+0x85

    fffffadc`1f1c1b10 fffff800`0102828e : 0000000a`b306fa71 fffff800`011b4dc0 fffffadc`44605c88 fffffadc`44605bf0 : nt!KiSwapThread+0x3c9

    fffffadc`1f1c1b70 fffff800`01047688 : 00000000`000000d4 fffffadc`0000001b fffffadc`1edeb100 fffffadc`1edeb100 : nt!KeWaitForSingleObject+0x5a6

    fffffadc`1f1c1bf0 fffff800`01047709 : 00000000`00000000 fffffadc`167a6c70 fffffadf`fbc44a00 fffff800`01024d4a : nt!ExpWaitForResource+0x48

    fffffadc`1f1c1c60 fffffadc`167a720b : fffffadc`3d41fc20 fffffadc`167a6c70 fffffadc`44605bf0 fffffadc`3d3deef8 : nt!ExAcquireResourceExclusiveLite+0x1ab

    fffffadc`1f1c1c90 fffffadc`167a6c87 : 00000000`00000001 fffffadf`fbc44a00 fffff800`011cda18 fffffadc`44605bf0 : XYZ+0x1220b

    fffffadc`1f1c1cd0 fffff800`010375ca : 00000000`00000000 fffffadc`1d6f7001 00000000`00000000 fffffadf`fbc44a00 : XYZ+0x11c87

    fffffadc`1f1c1d00 fffff800`0124a972 : fffffadc`44605bf0 00000000`00000080 fffffadc`44605bf0 fffffadc`1edf3680 : nt!ExpWorkerThread+0x13b

    fffffadc`1f1c1d70 fffff800`01020226 : fffffadc`1edeb180 fffffadc`44605bf0 fffffadc`1edf3680 00000000`00000000 : nt!PspSystemThreadStartup+0x3e

    fffffadc`1f1c1dd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16

     

    Note: The driver name was removed on purpose.

     

    After the problem with the 3rd party Firewall driver was addressed, the issue was resolved. Also thanks to Ignacio J who driven this case from a technical perspective and shared the resolution details with us.

     

    Hope this helps in your similar problems.

     

    Thanks,

    Murat

     

  • Have you ever wanted to see which Windows process sends a certain packet out to network?

    Hi there,

    We are sometimes requested to find out which process sends a certain packet. This generally stems from security concerns (like seeing some TCP session attempts on firewall/IDS device logs etc). Even though it may not be exactly possible to tell which process sends a certain packet always, most of the time we can find the process name without any live debugging etc)

    So here are some methods to identify the process sending a certain packet out to network:

    1) Network Monitor (For any OS version)

    Network Monitor 3.x can show you the process name most of the time correctly (even though there could be some situations when you cannot identify the process)

     

    Note: You can download the latest Network Monitor version at the following link:
    http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=983b941d-06cb-4658-b7f6-3088333d062f

    2) Process Monitor (For any OS version)

    Process Monitor can also display process name for TCP/UDP activity


    Note: You can download Process Monitor tool at the following link:
    http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx 


    3) For Windows 7 / Windows 2008 R2

    You can use the new built-in ETL tracing available at NDIS layer. All you need to do is to start a new ETL packet capturing session. This method doesn't even require you to install any sniffing software (Network Monitor/Wireshark etc). You can use this option for general packet capturing on Windows 7/Windows 2008 R2 as well:

    netsh trace start capture=yes tracefile=c:\temp\capture1.etl

    <<Repro the behavior>>

    netsh trace stop

    Note: You'll need to open the created ETL file with Network Monitor 3.3/3.4



    4) For Windows XP/2003

    => You can use the following method to find out the process if it's a TCP or UDP packet

    A. To enable the kernel logger and have it generate a log file of TCP/IP activity, follow these steps:

    1. Run the Performance Tool, and select the Performance Logs And Alerts node.
    2. Select Trace Logs, and then select New Log Settings from the Action menu.
    3. When prompted, enter a name for the settings (for example, Microsoft Data).
    4. On the dialog box that opens, select the Events Logged By System Provider option and then deselect everything except the Network TCP/IP option.
    5. In the Run As edit box, enter the Administrator account name and set the password to match it.
    6. Dismiss the dialog box, and wait until the event is logged.
    7. Once the event is logged, select Stop from the Action menu in Perfmon.
    8. Open a command prompt, and change to the C:\Perflogs directory (or the directory into which you specified that the trace log file be stored) and get the

    Perfmon trace file from there.

    B. Analyze the collected trace file:

    1. If you are running Windows XP or Windows Server 20003, run Tracerpt (located in the \Windows\System32 directory) and pass it the name of the trace log file. If you are running Windows 2000, download and run Tracedmp from the Windows 2000 Resource Kit. Both tools generate two files: dumpfile.csv and summary.txt.
    2. Rename dumpfile.csv as dumpfile.txt, start Microsoft Excel, click on Menu and Open. In the open dialog box locate the folder where dumpfile.txt is and
    change the file type box to text file. dumpfile.txt is visible now, click on it twice. Text Import Wizard pops up. Choose delimited, next, on the next page
    check only comma at delimiters, next, select all the columns in the Data preview box and change the Data type to text, finish.
    3. Rows starting with Event Name,  EventTrace, SystemConfig contain information about the trace, hardware, running services... etc.
    4. Rows starting with UdpIp or TcpIp ...

    UdpIp or TcpIP, Send or  Recv, ?, ?, ?, ?, PID, Payload of the transport (byte), Destination IP, Source IP, Destination port, Source port, ?, ?

    An example output:

    ...
           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.016.001.001, 172.016.057.015, 2762, 36003, 0, 0
           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.027.045.020, 172.016.057.015, 6004, 36009, 0, 0
           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.160.076.112, 172.016.057.015, 2344, 36016, 0, 0
           UdpIp,       Send, 0xFFFFFFFF,   129079416141424158,          0,          0,     2136,        8, 172.048.102.066, 172.016.057.015, 1116, 36022, 0, 0
    ...

    - For example, in the first line, 172.016.057.015 host is sending a UDP packet to 172.016.001.001. Packet lenght is 8 bytes and source UDP port is 36003 and

    destination UDP port is 2762. And process ID that is sending the UDP packet is 2136. And the tasklist output collected on the same machine was as follows:

    ...
    cqmghost.exe                  4972 Console                    0      4,324 K
    svchost.exe                   5024 Console                    0      1,524 K
    wmiprvse.exe                  5316 Console                    0      2,192 K
    wmiprvse.exe                  5468 Console                    0        976 K
    logon.scr                     6080 Console                    0        200 K
    mssearch.exe                  6028 Console                    0      5,828 K
    wmiprvse.exe                  5176 Console                    0      2,168 K
    mad.exe                       7176 Console                    0     45,792 K
    AntigenStore.exe             10092 Console                    0        200 K
    store.exe                     2136 Console                    0  1,040,592 K
    emsmta.exe                   12020 Console                    0     29,092 K
    AntigenIMC.exe                1308 Console                    0        200 K
    svchost.exe                   6516 Console                    0      1,992 K
    ...

    So the process that was sending the UDP packet was store.exe (process Id:2136)


    5) Special scenarios

    In some special scenarios, you may not be able to find the process by using the above methods. For example, in a couple of cases I was requested to find out why a certain client was sending so many ARP requests back to back (as it was causing some alarms to be generated by IDS devices and the LAN switch was disabling the switch port after a short while) and none of the above methods helped. Then I decided to do it that way: let us allow the client to do whatever it likes by loosening the existing security settings (like changing switch configuration so that it doesn't disable the port). That way we had a chance to see what the client was doing after those ARP requests. Another similar problem was reported by another customer. They realized that the DCs were trying to get in touch with the clients by using different ports/protocols like TCP 445/139, ICMP, Kerberos etc. Again we led the DC do whatever it tried to do so that we were able to see the activity and hence the process/reason behind that. I'm going to talk about this in another blog post.


    Hope this helps

    Thanks,
    Murat