Networking performance has increasingly become one of the most important factors in overall system performance. Many of the factors that affect networking performance fall under the following three categories: Network adapter hardware and driver performance, network stack performance, and the way applications interact with the network stack. We will highlight some of the more important networking adapter properties and advanced features available to yield optimal performance.
Networking performance is measured in throughput and response time. It is also important to achieve the optimal performance operating point without over-utilizing the system resources. To reach that optimal performance point, the network adapter vendors have enhanced their hardware capabilities to improve scalability, the amount of time to process data, and built in capabilities to dynamically adjust hardware and software parameters depending on workload characteristics.
Over the past decade, networking speeds have multiplied orders of magnitude to keep up with applications that have become network intensive and the load on the host processors increases for networking service routines. In light of these changes, it has become increasingly important to consider offloading some of the tasks to the hardware and further optimize how the hardware interacts with the software to scale and improve performance. Some of the features include Task Offload, TCP Offload, Interrupt Moderation, dynamic tuning on the hardware, Jumbo Frames, and Receive Side Scaling (RSS). These are particularly important for the high-end network adapter that will be used in configurations requiring top performance.
The Microsoft networking stack can offload one or more of the following tasks to a network adapter that has the support for the offload capabilities. The following are the supported offload tasks:
Checksum Offload: For most common IPv4 and IPv6 network traffic, offloading the checksum calculation to the network adapter hardware offers a significant performance advantage by reducing the number of CPU cycles required per byte and overall system performance improves. UDP checksum offload support has been added in Windows Server 2008 and was not available in prior releases.
Large Send Offload: Applications that send down large amounts of data rely on the network stack to segment the large buffers into smaller Ethernet size buffers. That size is typically 576 bytes on the Internet and 1500 bytes on LANs. Large Send Offload (LSO) allows the coalescing of send data into 64KB segments and to offload the segmentation work to the hardware. Offloading the work reduces the host CPU cycles and can improve overall system performance. Giant Send Offload (GSO) is a superset of LSO which allows for send buffers coalescing into segments that are greater than 64KB.
IP Security (IPSec) Offload: Windows offers the ability to offload the encryption work of IPSec to the network adapter hardware. Encryption, especially 3 DES, has a very high cycles/byte ratio. Therefore, it is no surprise that offloading IPSec to the network adapter hardware has high performance yields depending on the scenario and the workload characteristics.
A simple network adapter interrupts the host processor upon receiving a packet or when sending a packet is complete. In many scenarios, where this high processor utilization, it is best to coalesce several packets for each interrupt and reduce the number of times the host processor is interrupted. Because it is a common mistake to tune interrupt moderation for throughput and hurt response time, most network adapter vendors have implemented dynamic interrupt moderation schemes in their solutions.
Supporting larger Maximum Transmission Units (MTUs) and thus larger frame sizes, specifically Jumbo Frames, will reduce the number of trips needed to send the same amount of data. This results in a significant reduction of CPU utilization.
For large scale applications, being able to simultaneously process networking requests on multiple processors insures both improved performance through parallelism and CPU load distribution amongst the system processors. Receive Side Scaling, when supported by the underlying hardware, provides this capability for TCP traffic and the technology is recommended for Web and File Server scenarios where a server is servicing requests from a large number of connections.
Most network adapters allow administrators to manually configure send and receive resources through the Advanced Networking tab for the adapter. The most common ones are the receive and send buffers, which most of the time are configured to the low mark resulting in sub-optimal performance. A small subset of adapters allow for dynamic adjustment for their networking resources so a manual configuration is unnecessary.
When choosing which network adapters to you use, you should always get 64-bit PCI-Express adapters. Using 32-bit adapters will limit the amount of data the adapter can transfer and will provide sub-optimal performance when copying data into the adapter’s buffers. Also, an adapter that is not PCI-E will get capped at around 7 Gigabit per seconds pure data transfer excluding headers. This becomes a bottleneck if you are planning to use 10 Gigabit adapters in your setup and would like to get the full bandwidth out of it.
The high performance features discussed earlier about network adapters are necessary for achieving best performance on most scenarios and workloads in single and multiple processor system configurations. The following are guidelines for where to put the adapter in the system and how to best optimize the distribution of the network adapter interrupts.
It was mentioned earlier that it is best to place the network adapter in a 64-bit PCI-Express slot. That guarantees the best performance. If you will be using PCI or PCI-X slots, then you should try and place your adapter in a slot that does not share the bus with other devices in the system. Most hardware vendors have diagrams on the inside of the servers or on their websites that describes the layout of the machine. Sharing a bus with other devices can degrade performance because of latency on the bus and contention.
Interrupts generated by the network adapter can be partitioned to a single or a select group of processors to improve performance. An example would be partitioning your application threads to run on the same processor where network traffic is processed to preserve cache locality. In order to bind an adapter’s interrupts to a select group of processors, we recommend using the Interrupt Affinity Policy tool.
· Windows Server 2008 Performance Tuning Guidehttp://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx
· High Performance Network Adapters and Drivershttp://www.microsoft.com/whdc/device/network/NetAdapters-Drvs.mspx
Ahmed TalatPerformance ManagerWindows Server Performance Team