• Getting started with Storage Replica in Windows Server Technical Preview

    Storage Replica (SR) is a new feature that enables storage-agnostic, block-level, synchronous replication between servers for disaster recovery, as well as stretching of a failover cluster for high availability. Synchronous replication enables mirroring of data in physical sites with crash-consistent volumes ensuring zero data loss at the file system level. Asynchronous replication allows site extension beyond metropolitan ranges with the possibility of data loss.

    Ned Pyle, the Product Manager for Storage Replica, has written a great “getting started” guide here:

    http://social.technet.microsoft.com/Forums/windowsserver/en-US/f843291f-6dd8-4a78-be17-ef92262c158d/getting-started-with-windows-volume-replication?forum=WinServerPreview

    I got mine going after adding the Windows Storage Replication feature in Server Manager:

    image

    It’s configured in Failover Clustering:

    image

    I’m working with a customer who is really excited that in-box volume replication has come to Windows Server. It’s going to be interesting to discover best practices and ideal use cases for Storage Replica as we get closer to the final release.
  • Migrating DFS Namespace from Windows 2000 Server mode to Windows Server 2008 mode

    Hi,
     
    I recently helped a customer with this tricky little exercise.
    The idea was to do the upgrade during office hours with as little downtime as possible and to run it remotely from one server.
     
    It’s following the basic guide here: http://technet.microsoft.com/en-us/library/cc753875.aspx
     
    But this formal guide wasn’t very “real world”. It forgets that there are clients out there with cached referrals who take a long time to realise that the namespace links have changed.
     
    In my example, I use these names which you’ll need to change:

    ·        The existing 2003 DFS server is FILE1. All the commands below are run on this server in C:\temp.

    ·         The first new 2012 R2 DFS server is FILE2

    ·         The domain is child.corp.contoso.com

    ·         The DFS Namespace in the domain is called “Testing”, so the path is \\child.corp.contoso.com\Testing

    ·         There are 3 target servers where all DFS links point to: TARGET1 , TARGET2 , TARGET3

     
    First, start by setting the existing 2003 DFS servers to issue FQDNs in their DFS referrals. This was a requirement of the customer as they wanted other forests to access this namespace, and wanted to be as efficient as possible.
    Note this requires restarting DFS-N:
     

    dfsutil server registry DfsDnsConfig set \\FILE1
    sc \\FILE1 stop DFS
    ping -n 5 127.0.0.1 > NUL
    sc \\FILE1 start DFS
    Run this for each of the 3 existing DFS servers.

     
    Next we copy the DFS root folder and share (including all security) to the new DFS servers:
     

    robocopy C:\DFSRoots\ \\FILE2\c$\DFSRoots\ Testing /COPYALL /E /XJ
    reg copy \\FILE1\HKLM\System\CurrentControlSet\Services\LanManServer\Shares \\FILE2\HKLM\System\CurrentControlSet\Services\LanManServer\Shares /s /f

     
    You will run this command for all new DFS-N servers
     
    We then set the new DFS servers to use FQDNs in their referrals:
     

    dfsutil server registry DfsDnsConfig set \\FILE2

     
    We need to restart the Server service so it can start sharing the shares we copied just before. This will also restart the DFS-N service for us, so the FQDN change will work:
     

    echo net stop LanManServer /yes > \\file2\c$\Temp\restartsvc.bat
    echo net start LanManServer >> \\file2\c$\Temp\restartsvc.bat
    echo net start DFS >> \\FILE2\c$\Temp\restartsvc.bat
    psexec \\FILE2 -d -accepteula c:\temp\restartsvc.bat
    Echo Waiting for 30 seconds for the Server service on FILE2 to restart
    ping -n 30 127.0.0.1 > NUL

     
    PSExec is needed to run the bat file because once the server service is stopped, it can’t be started remotely because it’s needed to accept the remote commands.
     
    Next, we need to export the existing DFS-N configuration and change all the short-name paths to FQDNs:
     

    dfsutil root export \\child.corp.contoso.com\Testing C:\temp\export.xml
    REM Get FNR.exe from here: http://findandreplace.codeplex.com/
    fnr.exe --cl --dir "C:\temp" --fileMask "export.xml" --find "\\\\TARGET1\\" --replace "\\\\TARGET1.child.corp.contoso.com\\" --silent
    fnr.exe --cl --dir "C:\temp" --fileMask "export.xml" --find "\\\\TARGET2\\" --replace "\\\\TARGET2.child.corp.contoso.com\\" --silent
    fnr.exe --cl --dir "C:\temp" --fileMask "export.xml" --find "\\\\TARGET3\\" --replace "\\\\TARGET3.child.corp.contoso.com\\" --silent

     
    We add the new DFS-N servers to the existing namespace:
     

    dfsutil target add \\FILE2.child.corp.contoso.com\Testing

     
    Repeat this for each of the 3 new namespace servers.
     
    Then we find the PDCe for the domain. We need to restart the DFS-N service on the PDCe as it doesn’t seem to accept that we add a new namespace servers so easily:
     

    nltest /dnsgetdc:child.corp.contoso.com /PDC | find ".">PDC.txt
    FOR /F "delims=. " %i IN (PDC.txt) DO sc \\%i stop DFS
    ping -n 5 127.0.0.1>NUL
    FOR /F "delims=. " %i IN (PDC.txt) DO sc \\%i start DFS
    ping -n 5 127.0.0.1>NUL

     
    You now need to wait for a DAY for the new namespace servers to actually take effect on the SMB clients out there on the network.
    If a session has an open write connection to any file on the DFS namespace path, then it will NOT update its target list.
    You verify this by logging on to the client computer and running:
     

    dfsutil.exe cache referral
    If it shows something like this…:
    Entry: \child.corp.contoso.com\Testing
    ShortEntry: \child\Testing
    Expires in 276 seconds
    UseCount: 0 Type:0x81 ( REFERRAL_SVC DFS )
      0:[\OLD-SERVER-1\Testing] AccessStatus: 0 ( ACTIVE TARGETSET )
       1:[\OLD-SERVER-3\Testing]
       2:[\OLD-Server-2\Testing]

     
    …and doesn’t list the new server names, then do NOT proceed to the next step. You can try flushing the DFSN cache, but if there are open files, this won’t have any effect. You will need to reboot this server before it learns of the new DFSN servers.
     

    dfsutil.exe cache referral flush
    You are looking for an output which looks like this:
    Entry: \child.corp.contoso.com\Testing
    ShortEntry: \child.corp.contoso.com\Testing
    Expires in 276 seconds
    UseCount: 0 Type:0x81 ( REFERRAL_SVC DFS )
      0:[\OLD-SERVER-1\Testing] AccessStatus: 0 ( ACTIVE TARGETSET )
       1:[\OLD-SERVER-3\Testing]
       2:[\NEW-SERVER-3.child.corp.contoso.com\Testing]
       3:[\NEW-SERVER-2.child.corp.contoso.com\Testing]
       4:[\OLD-SERVER-3\Testing]
       5:[\NEW-Server-1.child.corp.contoso.com\Testing]

     
    If you are sure that all the important SMB clients have updated their lists of possible namespace servers to include the 3 old servers and the 3 new servers (so 6 in all), then you can do the next step.
    The next step can only be run in the GUI because dfsutil cannot disable links. Open the DFS console, select the 3 old namespace servers and choose “Disable Namespace Server” for each of them:
     
    This will make sure that anyone who is still using the old Namespace servers can continue to do so, but that anyone who asks for a new referral will NOT receive the old servers.
     
    WAIT ANOTHER DAY.
    We need to make sure that all SMB clients end their sessions (by closing all their open write handles) and create new referral caches with only the new DFS-N servers in them.
    Logon to one of the big SMB clients on the network and verify that the OLD name space servers do NOT appear in referral cache:
     

    dfsutil.exe cache referral
    This should now look like this:
    Entry: \child.corp.contoso.com\Testing
    ShortEntry: \child.corp.contoso.com\Testing
    Expires in 276 seconds
    UseCount: 0 Type:0x81 ( REFERRAL_SVC DFS )
       0:[\NEW-SERVER-1.child.corp.contoso.com\Testing] AccessStatus: 0 ( ACTIVE TARGETSET )
       1:[\NEW-SERVER-3.child.corp.contoso.com\Testing]
       2:[\NEW-SERVER-3.child.corp.contoso.com\Testing]

     
    If old server names appear in the list, do NOT continue to the next step. Again, you can try to flush the DFS referral cache, but this is unlikely to work. The SMB client will likely need to be restarted (again).
     
    Remove the old namespace servers. Don’t use the GUI as this will attempt to also remove the share, which may cause extra difficulties if you need to rollback this step:
     

    dfsutil target remove \\FILE1\Testing

     
    Repeat this command for each of the 3 old namespace servers.
     
    We will now delete the existing DFS-N namespace and create a new one which is in Windows Server 2008 mode (aka v2). Existing sessions will be unaffected as they work from their referral cache. New sessions may fail if they are created in the next few seconds. Retrying will succeed without any intervention.
    Run these commands on one of the new DFS-N servers. Copy the export.xml file from the old DFS-N server where you have been running the previous commands from.
    This needs to be run locally due to the time it takes to re-create the links. It will be around 15 seconds when run locally and about 1-2 minutes if run remotely:
     

    dfsutil root remove \\child.corp.contoso.com\Testing
    dfsutil root AddDom \\FILE2.child.corp.contoso.com\Testing V2
    dfsutil root import set C:\temp\export.xml \\child.corp.contoso.com\Testing NoBackup
    dfsutil target add \\FILE3.child.corp.contoso.com\Testing
    dfsutil target add \\FILE4.child.corp.contoso.com\Testing

     
    Where FILE3 and FILE4 in the example above are the additional new DFS-N servers.
     
    That will do it. You are now running the same exact namespace setup with the same permissions, but on new computers, using FQDN referrals and a v2 namespace.
  • Office 2013 Security Baselines for SCM are live

    Hi,

    Pat Fetty recently blogged about the new SCM baselines for Office 2013 going live.

    I opened up my local copy of SCM and imported the content:

    Prompt to import Offce 2013 baselines

    .cab and att files

    The .cab file contains the security settings. The “att” file contains the attachments which are Word documents describing the security baseline settings.

    You may get prompted at this point to accept the security details of the package. Inspect the certificates to make sure they are issued by Microsoft and are trusted by your computer.

    User and Computer product-specific settings

    There are user and computer settings, separated by individual Office programs or core Office settings.

    Done!

    Done!

     

    Browsing these new settings looks like this:

    SCM displaying Office 2013 settings

    Once you export these settings into a GPO Backup and import them onto an existing blank GPO in your domain, you’ll want the ADMX/ADML files which relate to the Office 2013 settings. And you’ll probably want to save them into your PolicyDefinitions folder in SYSVOL:

    \\your.domain.name\SYSVOL\your.domain.name\Policies\PolicyDefinitions

    Get theme here:

    http://www.microsoft.com/en-us/download/details.aspx?id=35554

    Office 2013 ADMX/ADML file download 

  • A backup server flooded by DPCs

    Hi,

    I’ve just finished working on a case with a customer that was so interesting that it deserved a blog post to round it off.

    These were the symptoms:

    Often while logged in to the server things would appear to freeze – no screen updates, little mouse responsiveness, if you could start a program (perfmon, Task Manager, Notepad etc.) then you wouldn’t be able to type into it and if you did it would crash.

    This Windows Server 2008 R2 server runs TSM backup software with thousands of servers on the network sending their backup jobs to it. At any one time there could be hundreds of backup jobs running. The load was lower during the day, but it was always working hard dealing with constant backups of database snapshots from servers. The backup clients are Windows, UNIX, Solaris, you name it…

    When the server froze, you’d see 4 of the 24 logical CPUs lock at 100% and the other 20 CPUs would saw-tooth from locking at 100% to using 20-30%. The freeze would happen for minutes at a time.

    CPUs 0,2,4,6 locked at 100%, others saw-tooth

    There are 2 Intel 10GB NICs in a team using Intel's teaming software. The team and the switches are setup with LACP to enable inbound load balancing and failover.

    By running perfmon remotely before the freeze happens we could see that the 4 CPUs that are locked at 100% are locked by DPCs. We used the counter “Processor Information\% DPC Time”.

    A DPC is best defined in Windows Internals 6th Ed. (Book 1, Chapter 3):

    A DPC is a function that performs a system task—a task that is less time-critical than the current one. The functions are called deferred because they might not execute immediately. DPCs provide the operating system with the capability to generate an interrupt and execute a system function in kernel mode. The kernel uses DPCs to process timer expiration (and release threads waiting for the timers) and to reschedule the processor after a thread’s quantum expires. Device drivers use DPCs to process interrupts.

    Because this is a backup server, we’re expecting that the bulk of our hardware DPCs will be generated by incoming network packets and raised by the NICs. Though they could have been coming from the tape library or the storage arrays.

    To look into what exactly is generating DPCs and how long the DPCs last for, we need to run Windows Performance Toolkit, specifically WPR.exe (Windows Performance Recorder). We have to do this carefully. We don’t want to increase the load of the server by capturing the Network and CPU activity of a server which already has high activity on the CPU and Network, and has shown a past history of crashing. But we want to run the capture while the server is in a frozen state. A tricky thing. So we ran this batch file:

    Start /HIGH /NODE 1 wpr.exe -start CPU –start Network -filemode –recordtempto S:\temp

    ping -n 20 127.0.0.1 > nul

    Start /HIGH /NODE 1 wpr.exe –stop S:\temp\profile_is_CPU_Network.etl

    If the server you are profile has a lot of RAM (24GB or more), you’ll want to protect your non-paged pool from increasing and harming your server. To do that you should review this blog and add this switch to the start command: –start "C:\Program Files (x86)\Windows Kits\8.0\Windows Performance Toolkit\SampleGeneralProfileForLargeServers.wprp"

    We’re starting on NUMA node 1 as the NICs were bound to NUMA node 0 and the “Processor Information” perfmon trace we took earlier showed that the CPUs on NUMA node 0 were locked. We’re starting the recorder with a “high” prioritization so that we can be sure it gets the CPU time it needs to work. We’re not writing to RAM, we’re recording to disk in the hopes that if the trace crashes we’ll at least have a partial trace to use. We made sure that S: in this example was a SAN disk to ensure it had the required speed to keep up with the huge data we’re expecting. We’re pinging 20 times to make sure our trace is 20 seconds long. And finally we’re starting a trace of CPU and Network profiles.

    Note that to gather stacks we first had to disable the ability for the Kernel (aka the Executive) to send its own pages of memory out from RAM to the pagefile, where we cannot analyze them. To do this run wpr –disablepagingexecutive on and then reboot.

    We retrieved 3 traces in all:

      1. The first trace to diagnose the problem
      2. The second trace after 2 changes were made which generated about 50% of our problem
      3. The final trace after the final change was made which created the other 50% of the problem

        Diagnosis

        So this blog now becomes a short tutorial on how you can use WPA (Windows Performance Analyzer) to locate the source of DPC issues. WPA is a VERY powerful tool and diagnosing problems is part science, part art. Meaning that no two diagnosis are ever done in the same way. This is just how I used WPA in this case. For this analysis, you’ll need the debugging tools installed and symbols configured and loaded.

        CPU Usage (Sampled)\Utilization By CPU

        First I want to see which CPUs are pegged. For that we use “CPU Usage (Sampled)\Utilization By CPU”, then select a time range by right-clicking:

        Choose a round number (10 seconds in my example) as it makes it easier to quickly calculate how many things happened per minute when comparing to the graphs for the later scenarios:

        Select Time Range

        I chose 20 seconds to 30 seconds as it is a 10 second window where there was heavy load and not blips due to tracing starting or stopping. Then “Zoom” by right clicking again.

        Now all your graphs will be focused on that time range.

        Then shift-select the CPUs which are pegged. In this case it is CPUs 0, 2, 4 and 6. This is because the cores are Hyperthreaded and the NICs cannot interrupt a logical CPU which is the result of Hyperthreading (CPUs 1, 3, 5, 7 etc.). And they are low-numbered CPUs because they are located on NUMA node 0.

        Once they are selected, right-click and choose “Filter to Selection”:

        Filter to Selection

        Next we want to add a column for DPCs so we can see how much of the CPUs time was spent locked processing DPCs. To add columns, just right click on the column title bar (in the screen above this has “Line # | CPU || Count | Weight (in view) | Timestamp”) on the centre of the right hand pane and select the columns you want to display. Once the DPC/ISR column has been added, drag it to the left side of the yellow bar, next to the CPU column:

        Choose columns

        Expanding out the CPU items, we see that DPCs count for almost all of the CPU activity on these CPUs (the count figures for the CPUs activity is 10 seconds of CPU time and the count of CPU time for DPCs under this is over 9 seconds).

        DPC duration by Module, Function

        The next WPA graph we need is the one which can show how long the DPCs last for. We drag in the first graph under “DPC/ISR” called “DPC duration by Module, Function”:

        DPC duration by Module, Function

        One the far right column (“Duration”), we can see how long each module spends waiting with a DPC. This says that 36.8 seconds were spent on DPCs for NDIS.SYS alone. How can it be 36.8 seconds if the sample window is 10 seconds? Well, it is CPU seconds, and we have 24 CPUs, so we could potentially have 240 CPU seconds in all.

        The next biggest waiter for DPCs is storport.sys. But at 1 second, it’s not even close.

        The column with the blue text is called “Duration (Fragmented) (ms) Avg” and is the average time a DPC lasts for during this sample window. The NDIS.SYS DPCs last around 0.22 milliseconds, or 220 microseconds. The count of DPCs for NDIS and storport are comparatively similar (163,000 and 123,000 respectively), but because NDIS took so long on each DPC on average, it ended up locking the CPU for longer than storport did.

        So let’s add the CPU column, move it to the left side of the yellow line with it as the first column to pivot on:

        Filter to busy CPUs

        We can see that our targeted CPUs, 0, 2, 4. 6 have very high durations of DPC waits (using the last column for “Duration”, again) with no other CPU spending very much time in a DPC wait state. So we select these CPUs and filter.

        Expanding out the CPUs, we see that there are many different sources of DPCs, but that NDIS is really the biggest source of DPC waits. So we will now move the “Module” column to be the left-most column and remove the CPU column from view. We then right click on NDIS.SYS and “Filter to Selection” again as we only want to focus on DPCs from NDIS on CPUs 0, 2, 4, 6:

        Filter to NDIS

        One function, ndisInterruptDPC is causing our DPC waits. This is the one we’ll focus on. If we expand this, it will list every single DPC and how long that wait is. Select every single one of these rows by scrolling to the very bottom of the table (in this example there are 163,230 individual DPCs):

        Copy Column Selection

        Right click on the column called “Duration” and choose “Copy Other” and then “Copy Column Selection”. This will copy only the values in the “Duration” column. We can paste this into Excel and create a graph which shows the duration of the DPCs as a function of the number of DPCs present:

        Taken from Excel

        I have added a red line on 0.1 milliseconds because according the hardware development kit for driver manufacturers, a DPC should not last longer than 100 microseconds. Meaning DPC above the red line are misbehaving. And that this is the bulk of our time spent waiting on DPCs.

        So, we have established that we have slow DPCs on NDIS, and lots of them, and that they are locking our 4 CPUs. Our NICs aren’t able to spread their DPCs to any other CPUs and Hyperthreading isn’t really helping our specific issue. But what is causing the networking stack to generate so many slow DPC locks?

        DPC/ISR Usage by Module, Stack

        The final graph in WPA will show us this. From the category “CPU Usage (Sampled)”, drag in a graph called “DPC/ISR Usage by Module, Stack”. Filter to DPC (which will exclude ISRs) and our top candidates are:

        DPC/ISR Usage by Module, Stack

        1. ntoskrnl.exe (the Windows Kernel)
        2. NETIO.SYS (Network IO operations)
        3. tcpip.sys (TCP/IP)
        4. NDIS.SYS (Network layer standard interface between OS and NIC drivers)
        5. IDSvia64.sys (Symantec Intrusion Detection System)
        6. ixn62x64.sys (Intel NIC driver for NDIS 6.2, x64)
        7. iansw60e.sys (Intel NIC teaming software driver for NDIS 6.0)

        To see what these are doing we simply expand the stack columns by clicking the triangle of the row with the highest count, looking for informative driver names and a large drop in the number of counts present, indicating that this particular function is causing a consumption of CPU time.

        NTOSKRNL is running high because we are capturing. The kernel is spending time gathering ETL data. This can be ignored.

        NETIO is redirecting network packets to/from tcpip.sys for a function called InetInspectRecieve:

        NETIO.sys stack expansion

        TCP/IP is dealing with the NETIO commands above to do this “Receive Inspection”:

        TCPIP.SYS stack expansion

        NDIS.SYS is dealing with 2 main functions in tcpip.sys: TcpTcbFastDatagram and InetInspectRecieve again:

        NDIS.SYS stack expansion

        Other than ntoskrnl, these 3 Windows networking drivers all have entries for the drivers listed as 5, 6 and 7 above in their stacks.

        Diagnosis Summary

        Lots of DPCs are caused by 3 probable sources:

        1. Incoming packet inspection by the Symantec IDS system.
        1. The IDS system has to take every packet, compare it to a signature definition, and, if clean, allow it to pass. This action is causing slow DPCs
        • The NIC driver could be stale/buggy and generating slow DPCs.
        1. There is no evidence for this, but it’s usually a good place to start. There could be TCP offloading or acceleration features in the NIC and/or driver which haven’t been enabled but may improve network performance.
        • And finally the NIC teaming software is getting in between the NICs and the CPUs.
        1. That is, after all, the job of the NIC teaming software: to trick Windows into thinking that the incoming packets from 2 distinct hardware devices are actually coming from 1 device. The problem here, however, is that this insertion into the networking stack is pure software, but is likely causing very slow DPCs

        Action Plan

        Our actions were to make changes over 2 separate outage windows:

        1. Update the NIC driver and enable Intel I/OAT in the BIOS of the server.
        1. I/OAT is described in the spec sheet for the NIC like this: “When enabled within multi-core environments, the Intel Ethernet Server Adapter X520-T2 offers advanced networking features. Intel I/O Acceleration Technology (Intel I/OAT), for efficient distribution of Ethernet workloads across CPU cores. Load balancing of interrupts using MSI-X enables more efficient response times and application performance. CPU utilization can be lowered further through stateless offloads such as TCP segmentation offload, header replications/splitting and Direct Cache Access (DCA).”
        • Uninstall the NIC teaming software
        1. 3rd party NIC teaming software inhibits many TCP offloading features, and in this case generates large numbers of slow DPCs
        • On the second outage we uninstalled the IDS system.
        1. IDS was not configured on this (and all other) servers. But as the software had the potential to become enabled, it was grabbing every incoming packet for inspection, despite the fact that it wasn’t configured to inspect the packet or act on violations in any way. Stopping the service is insufficient, the driver must be removed from the hidden, non-plug and play section of the device manager. Manually removing the driver isn’t sufficient. The software will reinstall it at next boot. Only a full uninstall will do.

        After dissolving the NIC Team

        Here is what the picture looked like after we dissolved the NIC team, updated the NIC driver and enabled Intel I/OAT in the BIOS.

        DPC duration - No teaming, I/OAT enabled

        In this 10 second sample we can see that the 4 CPU cores are still effectively locked as the CPU time due to NDIS DPCs is 37.7 seconds (out of a possible maximum of 40 seconds. The number of DPCs has decreased by more than half to 55,000, meaning that the average duration of DPCs has become very long at 682 microseconds – triple the average time from before we removed the NIC team and enabled I/OAT.

        Taken from Excel

        The blue area of the graph above is the picture we had from before changes were made. The pink/orange area is the picture of DPC durations after removing NIC teaming and enabling I/OAT.

        So why did the average duration of DPCs get longer?

        It could be that the IDS software now does not need to relinquish its DPCs to make room on the same CPU cores as the DPCs for the NIC teaming driver. These 2 drivers must be locked to the same CPUs. With no need to relinquish a DPC due to another DPC of equal priority, the IDS DPCs are free to use the CPU for longer periods of time before being forced off.

        At any rate, it certainly isn’t fixed yet.

        After uninstalling Symantec IDS

        And finally here’s what the picture looked like after we uninstalled the IDS portion of the Symantec package. Remember, this service was not configured to be enabled in any way.

        DPC duration - no IDS

        You can see that the average time has dropped from 220 microseconds to 90 microseconds – below the 100 microsecond threshold required by the Driver Development Kit.

        In this 10 second sample there were 127,000 DPCs from NDIS on the 4 heavily used CPUs, but the CPU time they consumed was 11 seconds, a reduction from 36.8 seconds.

        Taken from Excel

        The blue area of the graph above is the picture we had from before changes were made. The pink/orange area is the picture of DPC durations after removing NIC teaming and enabling I/OAT. And the green area is the picture after IDS is removed.

        This is a dramatic improvement. Nearly all DPCs are below the 100 microsecond limit. The system is able to process the incoming load without locking up for high priority, long lasting DPCs.

        What about RSS?

        We’re not quite done though. 4 of our CPUs are still working very hard, often pegged at 100%. But why only 4? This is a 2-socket system with 6 cores on each socket. That gives us 12 CPUs where we can run DPCs. DPCs from one NIC are bound to one NUMA node. We already dissolved our NIC team, so we only have 1 NIC in action, so we are limited to 6 cores. RSS can spread DPCs over CPUs in roots of 2, meaning 1, 2, 4, 8, 16, 32 cores. Meaning we can at most use 4 CPUs per NIC.

        To scale out we would need to add more NICs and limit RSS on each of those NICs to 2 cores. We’d need to bind 3 NICs to NUMA node 0 and 3 to NUMA node 1. We’d also need to set the starting CPUs for those NICs to be cores 0, 2, 4, 6, 8 and 10. In that we can saturate every possible core.

        But to do this, we’d need to ensure that we can have multiple NICs, without using the teaming software. Which means we’d need to assign each NIC a unique IP address. To do that we need to make sure that the TSM clients can deal with targeting a server name with multiple IP addresses in DNS for that name. And if connectivity to the first IP address is lost, that TSM can failover to one of the other IP addresses. We’ll test TSM and  get back with our results later.

        But we need one more fundamental check before doing that: We need to make sure that the incoming packet, hitting a specific NUMA node and core is going to end up hitting the right thread of the TSM server where that packet is going to be dealt with and backed up. If we can’t align a backup client to the incoming NIC and align that NIC to the backup software thread that should process it, then we’ll be causing intra-CPU interrupts, or worse yet, cross NUMA interrupts. This would make the entire system much less scalable.

        image

        So this is how this would all look. The registry key to set the NUMA node to bind a NIC to is “*NumaNodeId” (including the * at the start). To set the base CPU, use *RssBaseProcNumber”. To set the maximum number of processors to use set “*RssBaseProcNumber”.

        These keys are explained here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff570864(v=vs.85).aspx

        and here: Performance Tuning Guidelines for Windows Server 2008 R2

        And more general information on how RSS works in Windows Server 2008 are here: Scalable Networking- Eliminating the Receive Processing Bottleneck—Introducing RSS

        Our problem in the above picture, however, is that our process doesn’t know to run its threads on the NUMA node and cores where the incoming packets are arriving. Had this been SQL server, we could have run separate instances configured to start using specific CPUs. Hopefully, one day, TSM will operate like this and become NUMA-node aware.

        I know this has been a long post, but for those who have read down to here, I do hope this has helped you with your troubleshooting using WPT.

      • Low throughput when copying files

        Hi,

        I have been helping a customer with a tricky issue recently regarding slow network performance for SMB file copies over their network.

        It came about after they took the settings defined in Security Compliance Manager for their member servers and deployed them as a Group Policy to their server OU. After doing this, they saw an 80% reduction in the performance in SMB file copies. But when we used Ntttcp.exe to test the network throughput via a test data stream, the throughput was not affected. Only SMB was affected.

        They had Windows Server 2008 R2 SP1 VMs on ESX with 1 virtual 10Gb NIC patched to a team of 2 physical 10Gb NICs. When 2 servers tried to copy a set of large test files without the SCM security settings applies, they could reach around 400Mbps. When we applied the settings, that dropped to around 80Mbps

        In the SCM security definitions, there are 234 settings defined. We had to find out which one of these settings caused their issue.

        image

        We could see that the CPUs of the VM were going nuts with a wild saw-tooth pattern of all CPUs. We tried adding more CPUs and the saw-tooth pattern simply spread without making any major change in achievable throughput.

        The process consuming the CPU time in Task Manager was ‘System’.

        So, to break into ‘System’ a little more, we ran Windows Performance Recorder (WPR) to get a trace of CPU activity, like this:

        image

        And in the trace, we expanded out “CPU Usage (Sampled)”, and added the graph for “DPC and ISR by Module, Stack”:

        DPC and ISR by Module, Stack

        This showed us that all our CPU time was spent processing DPCs generated by a driver called cng.sys

        image

        This is “Kernel Cryptography, Next Generation” which relates to the server or clients ability to calculate cryptographic equations in the Kernel when doing things like sending or receiving encrypted information, or information which has been signed. Signing in this case could be creating a signature hash for chunks of transmitted data to prove that is hasn’t been modified while on the wire.

        This, combined with the fact that only SMB was affected lead us to think it was SMB signing that was our issue.

        SMBv2 uses these 2 GPO settings to define SMB signing:

        image

        1. Microsoft Network Server: Digitally sign communications (always)
        2. Microsoft Network Client: Digitally sign communications (always)

        The settings relate to SMBv2. Note that they change the default, in-box setting from “Disabled” to the Microsoft recommended SCM setting of “Enabled”.

        For SMBv1 on Windows 2003 and older, the GPO settings are:

        1. Microsoft Network Server: Digitally sign communications (if client agrees)
        2. Microsoft Network Client: Digitally sign communications (if server agrees)

        Once we removed the “always” settings, the transfer speed returned back to the higher 400Mbps transfer speed we expected.

        We discussed the usefulness of this setting and in their network, it would be best to keep the “server” side setting enabled on DCs only to ensure that the GPO files which clients will download from the DCs during a Group Policy refresh have not been altered as these files are security sensitive files, but are usually very small and we don’t mind slightly slower transfer speeds for these files.

         

        Here’s some additional resources we used when investigating SMB signing:

        http://blogs.technet.com/b/josebda/archive/2010/12/01/the-basics-of-smb-signing-covering-both-smb1-and-smb2.aspx

        http://msdn.microsoft.com/en-us/library/a64e55aa-1152-48e4-8206-edd96444e7f7#id218

        http://blogs.msdn.com/b/openspecification/archive/2009/07/06/negtokeninit2.aspx?Redirected=true

        http://blogs.msdn.com/b/openspecification/archive/2009/04/10/smb-maximum-transmit-buffer-size-and-performance-tuning.aspx

        http://blogs.technet.com/b/filecab/archive/2012/05/03/smb-3-security-enhancements-in-windows-server-2012.aspx

        http://support.microsoft.com/kb/320829

        http://blogs.technet.com/b/neilcar/archive/2004/10/26/247903.aspx

        http://gallery.technet.microsoft.com/NTttcp-Version-528-Now-f8b12769