• Why are local resources accessed slowly when loopback or local IP address is used whereas accessing the same resources over the network works fine?

    Hi there,

     

    In today’s blog post, I’m going to talk about a local resource access performance issue.

     

    One of our customers reported that SQL server instances running on a Windows 2003 server were failing over to another node especially when CPU load on the server was high for a short period (like 5 seconds or so). After some research by our SQL expert, it was determined that the fail over occurred because the IsAlive() function implemented by the SQL server does a “SELECT @@SERVERNAME” T-SQL query to the local SQL server instance by using the local IP address and that query doesn’t return in a timely manner.

     

    When loopback interface was monitored during the problem time, it was also seen that “Output Queue Length” for the MS TCP loopback interface was dramatically increasing and then dropping down to 0 and then dramatically increase again like below:

    Where access to local SQL instance was too slow as such, the same SQL instance could be accessed without issues from remote clients. That behavior suggested that there was a problem with loopback access.

     

    RESOLUTION:

    ===========

    One of our senior escalation engineers (Thanks to PhilJ) mentioned that loopback access (accessing to 127.0.0.1 or any other local IP addresses) simply queues up Work items to be later processed by a function in AFD driver. The work items queued that way are later processed by a worker thread in kernel which runs a function provided by AFD driver. Normally system can create up to 16 of such dynamic worker threads and those worker threads run at a priority level of 12. If the system had a higher priority work and if it wasn’t possible to create new worker threads, then the problem might be visible as explained above.

     

    There’s a way to make more delayed worker threads available initially which could be configured as follows:

     

    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Executive

    AdditionalDelayedWorkerThreads  (REG_DWORD)

     

    You can find more information at the following link:

    http://msdn.microsoft.com/en-us/library/ee377058(BTS.10).aspx Registry Settings that can be Modified to Improve Operating System Performance

     

    Even though the AdditionalDelayedWorkerThreads registry key was set to a higher value than default, the problem was still in place. Finally it was decided to get a kernel dump to better understand what was going wrong with those Delayed worker threads so that local access was too slow. Before going to the conclusion part, it would be good to state that this local resource access problem was not specific to SQL and could be experienced with any application locally accessing a resource (there were similar issues reported by other customers like “slow performance seen when local applications accessing local web services, databases” etc)

     

    Finally a kernel dump has revealed the real problem why the delayed worker threads couldn’t catch up with the work items being posted:

     

    A 3rd party Firewall driver was posting many delayed work items and delayed worker threads processing these work items were all trying to synchronize on the same executive resource (which was apparently a contention point) and hence the system was hitting the ceiling for the number of worker threads quickly so that new items couldn’t processed in a timely manner and local resource access was being too slow. I just wanted to give a sample call stack for one of those delayed worker threads for your reference:

     

    THREAD fffffaee4460aaf0  Cid 0004.0044  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (WrResource) KernelMode Non-Alertable

        fffffadc3e9dc0a0  SynchronizationEvent

        fffffadc44605ca8  NotificationTimer

    Not impersonating

    DeviceMap                 fffffa80000036d0

    Owning Process            fffffadc44622040       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      294068         Ticks: 6 (0:00:00:00.093)

    Context Switch Count      200681            

    UserTime                  00:00:00.000

    KernelTime                00:00:03.500

    Start Address nt!ExpWorkerThread (0xfffff800010039f0)

    Stack Init fffffadc1f1c1e00 Current fffffadc1f1c1950

    Base fffffadc1f1c2000 Limit fffffadc1f1bc000 Call 0

    Priority 13 BasePriority 12 PriorityDecrement 1

    Child-SP          RetAddr           : Args to Child                                                           : Call Site

    fffffadc`1f1c1990 fffff800`01027682 : fffffadc`1edcb910 fffffadc`1edeb180 00000000`0000000b fffffadc`1ed2b180 : nt!KiSwapContext+0x85

    fffffadc`1f1c1b10 fffff800`0102828e : 0000000a`b306fa71 fffff800`011b4dc0 fffffadc`44605c88 fffffadc`44605bf0 : nt!KiSwapThread+0x3c9

    fffffadc`1f1c1b70 fffff800`01047688 : 00000000`000000d4 fffffadc`0000001b fffffadc`1edeb100 fffffadc`1edeb100 : nt!KeWaitForSingleObject+0x5a6

    fffffadc`1f1c1bf0 fffff800`01047709 : 00000000`00000000 fffffadc`167a6c70 fffffadf`fbc44a00 fffff800`01024d4a : nt!ExpWaitForResource+0x48

    fffffadc`1f1c1c60 fffffadc`167a720b : fffffadc`3d41fc20 fffffadc`167a6c70 fffffadc`44605bf0 fffffadc`3d3deef8 : nt!ExAcquireResourceExclusiveLite+0x1ab

    fffffadc`1f1c1c90 fffffadc`167a6c87 : 00000000`00000001 fffffadf`fbc44a00 fffff800`011cda18 fffffadc`44605bf0 : XYZ+0x1220b

    fffffadc`1f1c1cd0 fffff800`010375ca : 00000000`00000000 fffffadc`1d6f7001 00000000`00000000 fffffadf`fbc44a00 : XYZ+0x11c87

    fffffadc`1f1c1d00 fffff800`0124a972 : fffffadc`44605bf0 00000000`00000080 fffffadc`44605bf0 fffffadc`1edf3680 : nt!ExpWorkerThread+0x13b

    fffffadc`1f1c1d70 fffff800`01020226 : fffffadc`1edeb180 fffffadc`44605bf0 fffffadc`1edf3680 00000000`00000000 : nt!PspSystemThreadStartup+0x3e

    fffffadc`1f1c1dd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16

     

    Note: The driver name was removed on purpose.

     

    After the problem with the 3rd party Firewall driver was addressed, the issue was resolved. Also thanks to Ignacio J who driven this case from a technical perspective and shared the resolution details with us.

     

    Hope this helps in your similar problems.

     

    Thanks,

    Murat

     

  • Effects of incorrect QoS policies: A story behind a slow file copy...

    Hi there,

     

    In this blog post, I’ll talk about another network trace analysis scenario.

     

    The problem was that some Windows XP clients were copying files from a NAS device very slowly compared to others. As one of the most useful logs to troubleshoot such problems, I requested a network trace to be collected on a problem Windows XP client. Normally it’s best to collect simultaneous network traces but it was a bit diffcult to collect a trace at the NAS device side so we were limited to a client side trace.

     

    Before I start explaining how I got to the bottom of the issue, I would like to provide you with some background on how files are read by Windows via SMB protocol so that you’ll better understand the resolution part:

     

    Windows XP and Windows 2003 use SMB v1 protocol for remote file system access (like creating/reading/writing/deleting/locking files over a network connection). Since it was a file read from the remote server in this scenario, the following SMB activity would be seen between the client and server:

     

    Client                                      Server

    =====                                     ======

    The client will open the file at the server first:

     

    SMB Create AndX request ---->

                                              <---- SMB Create AndX response

     

    Then the client will send SMB Read AndX requests to retrieve blocks of the file:

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

     

    SMB Read AndX request   ----> (61440 bytes)

                                              <---- SMB Read AndX response

    ...

     

    Note: SMBv1 protocol could request 61 KB of data at most in one SMB Read AndX request.

     

     

    After this short overview, let’s get back to the original problem and analyze the packets taken from the real network trace:

     

    Frame#  Time delta    Source IP             Destination IP     Protocol        Information

    =====     ========     =========           ==========         ======            ========

    ....

    59269    0.000000              10.1.1.1                10.1.1.2                SMB       Read AndX Request, FID: 0x0494, 61440 bytes at offset 263823360

    59270    0.000000              10.1.1.2                10.1.1.1                SMB       Read AndX Response, 61440 bytes

    59271    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65993793

    59272    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65995249

    59273    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=65996705

    ...

    59320    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66049121

    59321    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66050577

    59322    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59270] microsoft-ds > foliocorp [ACK] Seq=66052033

    59323    0.000000              10.1.1.1                10.1.1.2                TCP        foliocorp > microsoft-ds [ACK] Seq=67600 Ack=66053489 Win=65535

    59325    0.406250              10.1.1.2                10.1.1.1                TCP       [Continuation to #59270] microsoft-ds > folioc [PSH, ACK]Seq=66053489

     

    59326    0.000000              10.1.1.1                10.1.1.2                SMB       Read AndX Request, FID: 0x0494, 61440 bytes at offset 263884800

    59327    0.000000              10.1.1.2                10.1.1.1                SMB       Read AndX Response, 61440 bytes

    59328    0.000000              10.1.1.2                10.1.1.1                TCP        [Continuation to #59327] microsoft-ds > foliocorp [ACK] Seq=66055297

    ...

     

    Now let’s take a closer look at some related frames:

     

    Frame# 59269 => Client requests the next 61 KB of data at offset 263823360 from the file represented with FID 0x0494 (this FID is assigned by server side when the file is first opened/created)

     

    Frame# 59270 => Server starts sending 61440 bytes of data back to the client in SMB Read AndX response.

     

    Frame# 59271 => The remaining parts are sent in 1460 bytes chunks because of TCP MSS negotiated, it’s generally 1460 bytes. (like frame# 59272, frame# 59273 etc)

     

    The most noticable thing in the network trace was to see many such 0.4 seconds delays (like the one that we see at frame #59325). Those 0.4 seconds delays were always present at the last fragment of 61 KB of data returned by the server.

     

    Normally 0.4 seconds could be seen as a very low delay but considering that the client will send n x SMB Read Andx request to the server to read the file it will quickly be clear that 0.4 seconds of delay is huge (for example, the client needs to send 1000 SMB Read AndX requests to read a 64 MB file)

     

    Generally we’re used to see some delays in network traces due to packet retransmissions (due to packet loss) or link transfer delayes etc. But seeing a constant delay of 0.4 seconds in every last fragment of a 61 KB block made me suspect that a QoS implementation was in place somewhere between the client and server. By delaying every read request about 0.4 seconds, actually file copy is being slowed down on purpose: traffic shaping/limiting.

     

    Since we didn’t have a network trace collected at the NAS device side, we couldn’t check if the QoS policy was in effect at the NAS device side or on a network device running in between the two. (we checked the client side and there was no QoS configuration in place). After further checking the network devices, it turned out that there was an incorrectly configured QoS policy on one of them. After making the required changes, the problem was resolved...

     

    Hope this helps

     

    Thanks,

    Murat

  • Where have those AFD driver related registry (DynamicBacklogGrowthDelta / EnableDynamicBacklog ...) keys gone?

    Hi there,

     

    In today’s blog post, I’m going to talk about some registry keys that were removed as of Windows 2008. Recently a colleague raised a customer question about configuring the following AFD related registry keys on Windows 2008:

     

    DynamicBacklogGrowthDelta
    EnableDynamicBacklog
    MaximumDynamicBacklog
    MinimumDynamicBacklog

     

    Actually our customer was trying to implement the settings mentioned in How To: Harden the TCP/IP Stack. But none of our documentation on Vista/2008/Windows 7 and Windows 2008 R2 referred to such AFD related registry keys.

     

    A quick source code review revealed that those registry keys weren’t valid as of Windows 2008. Those registry keys were mainly used to encounter TCP SYN attacks at Winsock layer on Windows 2003. Since SYN attack protection was built-in on Windows Vista, 2008, 2008 R2 or Windows 7 (and even couldn’t be disabled - please see this blog post for more information on TCP SYN attack protection on Windows Vista/2008/2008 R2/7), it wasn’t required to deal with SYN attacks at Winsock layer and as a result of that, the logic and the registry keys were removed from AFD driver.

     

    As an additional note here, I also wouldn’t recommend implementing EnablePMTUDiscovery registry key which is also mentioned in the document How To: Harden the TCP/IP Stack because of reasons mentioned in a previous blog post. Also SYN attack protection related registry keys mentioned in the same article don’t apply to Window Vista onwards.

     

    Hope this helps

     

    Thanks,

    Murat

  • Does sqllogship.exe have anything to do with web servers in the internet? Story behind CRL check for certificates ...

    Hi there,

     

    In today’s blog post I’m going to talk about a network trace analysis scenario where I was requested to analyze a few network traces to understand why a server was trying to contact external web servers when sqllogship.exe was run on it.

     

    Our customer’s security team noticed that there were http connection attempts coming from internal SQL servers where those servers wouldn’t supposed to be connecting any external servers. The only thing they were running was something like that:

     

    "C:\Program Files\Microsoft SQL Server\90\Tools\Binn\sqllogship.exe" -Backup 1B55E77D-A000-1EE8-9780-441096E2151 -server PRODDB

     

    And in every attempt there were blocked http connections seen on the firewall. Since we didn’t know what the server would do after establishing such an HTTP connection to an external network we weren’t able to make much comment on this. I requested our customer to let the firewall allow such an http connection so that we would be able to get more information after the connection was established, this is a method (method 5) I mentioned in one of my earlier posts

     

    After our customer made such a change and re-collected a network trace on the SQL server, it was now more clear why the SQL server was attempting to connect to a remote web server: To verify if the certificate was revoked or not by downloading the CRL (certificate revocation list):

     

     

    => SQL server first resolves the IP address for the name: crl.microsoft.com

     

    No.     Time                       Source                Destination           Protocol Info

    23519    2010-06-26 09:23:14.560786        10.11.1.11           10.1.1.1         DNS       Standard query A crl.microsoft.com

    23520    2010-06-26 09:23:14.561000        10.1.1.1                    10.11.1.11     DNS       Standard query response CNAME crl.www.ms.akadns.net

    |-> crl.microsoft.com: type CNAME, class IN, cname crl.www.ms.akadns.net

    |-> crl.www.ms.akadns.net: type CNAME, class IN, cname a1363.g.akamai.net

    |-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.18

    |-> a1363.g.akamai.net: type A, class IN, addr 193.45.15.50

     

    => SQL server establishes a TCP session to port 80 at the remote web server running on 193.45.15.50:

     

    No.     Time                       Source                Destination           Protocol Info

      69679 2010-06-26 09:24:37.466403          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [SYN] Seq=0 Win=65535 Len=0 MSS=1460

      69697 2010-06-26 09:24:37.554390          193.45.15.50   10.11.1.11                TCP      80 > 2316 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460

      69698 2010-06-26 09:24:37.554407          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

     

    => After the TCP 3-way handshake, the SQL server sends an HTTP GET request to the web server to retrieve the CSPCA.crl file

     

    No.     Time                       Source                Destination           Protocol Info

      69699 2010-06-26 09:24:37.554603          10.11.1.11            193.45.15.50       HTTP     GET /pki/crl/products/CSPCA.crl HTTP/1.1

        |-> GET /pki/crl/products/CSPCA.crl HTTP/1.1\r\n

        |-> User-Agent: Microsoft-CryptoAPI/5.131.3790.3959\r\n

        |-> Host: crl.microsoft.com\r\n

     

      69729 2010-06-26 09:24:37.642219          193.45.15.50   10.11.1.11                TCP      80 > 2316 [ACK] Seq=1 Ack=199 Win=6432 Len=0

      69731 2010-06-26 09:24:37.645483          193.45.15.50   10.11.1.11                PKIX-CRL Certificate Revocation List

        |-> HTTP/1.1 200 OK\r\n

    ...

        |-> Certificate Revocation List

        |-> signedCertificateList

        |-> algorithmIdentifier (shaWithRSAEncryption)

     

     

    Note: It looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/944752)

    When the Microsoft .NET Framework 2.0 loads a managed assembly, the managed assembly calls the CryptoAPI function to verify the Authenticode signature on the assembly files to generate publisher evidence for the managed assembly.”

     

     

    => Similarly the server sends another HTTP GET request to retrieve CodeSignPCA.crl:

     

    No.     Time                       Source                Destination           Protocol Info

      77631 2010-06-26 09:24:52.642968          10.11.1.11            193.45.15.50                       HTTP     GET /pki/crl/products/CodeSignPCA.crl HTTP/1.1

      77747 2010-06-26 09:24:52.733106          193.45.15.50   10.11.1.11                PKIX-CRL Certificate Revocation List

      78168 2010-06-26 09:24:53.011176          10.11.1.11            193.45.15.50                       TCP      2316 > 80 [ACK] Seq=403 Ack=1961 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

    ...

     

     

    Note: Again it looks like this is done due to the following: (Taken from http://support.microsoft.com/kb/947988 You cannot install SQL Server 2005 Service Pack 1 on a SQL Server 2005 failover cluster if the failover cluster is behind a firewall)

     

    When the Microsoft .NET Framework starts SSIS, the .NET Framework calls the CryptoAPI function. This function determines whether the certificates that are signed to the SQL Server assembly files are revoked. The CryptoAPI function requires an Internet connection to check the following CRLs for these certificates:

    http://crl.microsoft.com/pki/crl/products/CodeSignPCA.crl

    http://crl.microsoft.com/pki/crl/products/CodeSignPCA2.crl”

     

    It looks like there’re a number of solutions to prevent such CRL checks like changing “generatePublisherEvidence” or “Check for publisher’s certificate revocation” as explained in KB944752 or KB947988.

     

    Hope this helps

     

    Thanks,

    Murat

     

  • Where are my packets? Analyzing a packet drop issue...

    One of the most common reasons for network connectivity or performance problems is packet drop. In this blog post, I’ll be talking about analyzing a packet drop issue, please read on.

     

    One of customers was complaining about remote SCCM agent policy updates and it was suspected a network packet drop issue. Then we were involved in to analyze the problem from networking perspective. Generally such problems might stem from the following points:

     

    a) A problem on the source client (generally at NDIS layer or below). For additional information please see a previous blog post.

     

    b) A problem stemming from network itself (links, firewalls, routers, proxy devices, encryption devices, switches etc). This is the most common problem point in such cases.

     

    c) A problem on the target server (generally at NDIS layer or below). For additional information please see a previous blog post.

     

    In packet drop issues, the most important logs are simultaneous network traces collected on the source and target systems.

     

    You’ll find below more details about how we got to the bottom of the problem:

     

    NETWORK TRACE ANALYSIS:

    =======================

    In order for a succesful network trace analysis, you need to be familiar with the technology that you’re troubleshooting. At least you should have some prior knowledge on the network activity that the related action (like how SCCM agent retrieves policies from SCCM server in general for this exampl) would generate. In this instance, after a short discussion with an SCCM colleague of mine, I realized that SCCM agent sends an HTTP POST request to the SCCM server to retrieve the policies. I analyzed the network traces in the light of this fact:

     

    How we see the problematic session at SCCM agent side trace (10.1.1.1):

     

    No.     Time        Source                Destination           Protocol Info

      1917 104.968750  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

       1918 105.000000  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

       1919 105.000000  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

       1920 105.000000  10.1.1.1        172.16.1.1            HTTP     CCM_POST /ccm_system/request HTTP/1.1

       1921 105.000000  10.1.1.1        172.16.1.1            HTTP     Continuation or non-HTTP traffic

       1922 105.000000  10.1.1.1        172.16.1.1            HTTP     Continuation or non-HTTP traffic

       1925 105.140625  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 Len=0 SLE=1742 SRE=2328

       1975 107.750000  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2071 112.890625  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2264 123.281250  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       2651 144.171875  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       3475 185.843750  10.1.1.1        172.16.1.1            HTTP     [TCP Retransmission] Continuation or non-HTTP traffic

       4392 234.937500  172.16.1.1      10.1.1.1              TCP      http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

     

    How we see the problematic session at SCCM Server side trace (172.16.1.1):

     

    No.     Time        Source                Destination           Protocol Info

      13587 100.765625  10.1.1.1        172.16.1.1            TCP      rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

      13588 0.000000    172.16.1.1            10.1.1.1        TCP      http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

      13591 0.031250    10.1.1.1        172.16.1.1            TCP      rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 Len=0

      13598 0.031250    10.1.1.1        172.16.1.1            HTTP     CCM_POST /ccm_system/request HTTP/1.1

      13611 0.078125    10.1.1.1        172.16.1.1            HTTP     [TCP Previous segment lost] Continuation or non-HTTP traffic

      13612 0.000000    172.16.1.1            10.1.1.1        TCP      http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 [TCP CHECKSUM INCORRECT] Len=0 SLE=1742 SRE=2328

      30509 129.812500  172.16.1.1            10.1.1.1        TCP      http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

     

     

    Explanation on color coding:

     

    a) GREEN packets are the ones that we both see at client and server side traces. In other words, those are packets that were either sent by the client and successfully received by the server or sent by the server and successfully received by the client.

     

    b) RED packets are the ones that were sent by the client but not received by the server. To further provide details on this part:

    - The frame #1921 that we see at the client side trace (which is a continuation packet for the HTTP POST request) is not visible at the server side trace

    - The frames #1975, 2071, 2264, 2651 and 3475 are retransmissions of frame #1921. We cannot even see those 5 retransmissions of the same original TCP segment at the server side trace which means most likely all those retransmission segments were lost on the way to the server even though there’s still a minor chance that they might have been physically received by the server but dropped by an NDIS level driver (see this post for more details)

     

    The server side forcefully closes the session approximately after 2 minutes. (Client side frame #4392 and server side frame #30509)

     

    RESULTS:

    ========

    1) The simultaneous network trace analysis better indicated that this was a packet drop issue.

     

    2) The problem most likely stems from one of these points:

     

    a) An NDIS layer or below problem at the SCCM agent side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

    b) A network related problem (link or network device problem like router/switch/firewall/proxy issue)

    c) An NDIS layer or below problem at the SCCM server side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

     

    3) In such scenarios we may consider doing the following at the client and server side:

     

    a) Updating NIC/teaming drivers

    b) Updating 3rd party filter drivers (AV/Firewall/Host IDS software/other kind of filtering devices) or temporarily uninstalling them for the duration of troubleshooting

     

    4) Most of the time such problems stem from network itself. And you may consider taking the following actions on that:

     

    a) Checking cabling/switch port

    b) Checking port speed/duplex settings (you may try manually setting to 100 MB/ FD for example)

    c) Enabling logging on Router/Switch/firewall/proxy or similar devices and check interface statistics or check problematic sessions to see if there are any packet drops

    d) Alternatively you may consider collecting network traces at more than 2 points (sending & receiving ends). For example, 4 network traces could be collected: at the source, at the target and at two intermediate points. So that you can follow the session and may isolate the problem to a certain part of your network.

     

    For your information, the specific problem I mentioned in this blog post was stemming from a router running in between the SCCM agent and server.

     

    Hope this helps

     

    Thanks,

    Murat