Designing Applications for High Performance - Part 1
25 April 08 05:40 AM | winsrvperf | 1 Comments   

Rick Vicik - Architect, Windows Server Performance Team

 

Now that processors won’t be getting dramatically faster each year, application developers must learn how to design their applications for scalability and efficiency on multiple processor systems. I have spent the last 20 years in SQL Server development and the Windows Server Performance Group looking into multi-processor performance and scalability problems.  Over the years, I have encountered a number of recurring patterns that I would like to get designers to avoid.  In this three part series, I will go over these inefficiencies and provide suggestions to avoid them in order to improve application scalability and efficiency.  The guidelines are oriented towards server applications, but the basic principles apply to all applications.

 

The underlying problem is processors are much faster than RAM and need hardware caches or else they would spend most of their time waiting for memory access.  The effectiveness of any cache depends on locality of reference.  Poor locality can reduce performance by an order of magnitude, even with a single processor.  The problem is worse with multiple processors because data is often replicated in different caches and updates must be coordinated to give the illusion of a single copy (performing the magic of cache coherency is hard).  Also, applications might generate information that needs sharing across processors, which can overload the interconnect mechanism (e.g. bus) and slow down all memory requests, even for “innocent bystanders”.    

 

The following are some of the common pitfalls that can hurt overall performance:

·         Using too many threads and doing frequent updates to shared data.  This results in a high number of context switches due to lock collisions when several threads try to update the protected data. 

·         Cache effectiveness is reduced because thread data seldom has enough time in the cache before getting pushed out of the cache by other threads.

 

These are some of the things application designers can do to reduce the problem:

·         Minimize the need to have multiple threads update shared data through data partitioning across processors and minimize the amount of information that must cross boundaries (OO-design and the desire to have context-free components often results in “chatty” interfaces).

·         Minimize the number of context switches by keeping the number of threads close to the number of processors and minimize the reasons for them to block (locks, handing off work, handling IO-completion, etc.).  

 

To illustrate how partitioning an application would yield optimal performance compared to having shared data and lock contentions, I will use a simple, static web server scenario as an example.  The data in this scenario can be characterized as either payload (cached, previously-served pages) or control (work queues, statistics, freelists, etc.).  Figure 1 shows that the combination of updates and shared data must be avoided.  Even when the payload is read-only, the control data is usually update-intensive (e.g ref-counts).

 

 

The recommendation is to partition everything by processor or by NUMA node.  This can never be fully achieved in real applications, but it guides the design in the right direction.  Ideally, there should be per processor threads and each thread’s affinity gets set to the respective processor.  Each thread should have its own IO completion port and be event-driven.  There should be a network interface card (NIC) for each processor and the interrupts from each NIC should be bound to the corresponding processor by using the IntFilter utility on Windows 2003 or the IntPolicy utility on Windows 2008 and later.  Another alternative is using a NIC that supports Receive Side Scaling (RSS).  An intelligent network switch can perform link or port aggregation to distribute incoming requests to the multiple NICs.  Since the payload (cache of previously-served pages) is read-only, it can be read from any CPU.  Full partitioning (including disk data) would require distributing the requests to the partition that owns the subset of data.  That is beyond the capability of the network switch. 

 

Figure 2 illustrates one proposed design.  Each thread would loop on its completion port, servicing events as they occur (e.g. if a requested web page is not in cache, issue an asynchronous read to bring it in and attach the serving of that page to the IO completion).  The only updated shared data left is for the purpose of managing the cache (ref-counting, updating hash-synonym list, evicting older contents).  Frequently-updated statistical counters should also be kept per processor and rolled-up infrequently.

 

An application needs to be aware of the number of processors because it may need to distribute the load if the link or port aggregation technique isn’t good enough.  It may also need to perform a type of load balancing if the requests differ significantly in processing time.  Soft affinity (set via SetThreadIdealProcessor API) may be enough, but if the threads are hard-affinitized to processors (set via SetThreadAffinityMask API), periodic work-stealing logic may be needed to avoid some processors going idle while work queues up on others.  The handling of I/O completion gets trickier, but more details are provided later and I will explain how using Vista can help.   

 

The first part of this series will cover Threads and side effects associated with having too many active threads contending for resources or trying to update the same piece of memory.  It will also provide an overview of some of the improvements that have gone into Vista for thread handling.

 

Threading Issues

 

An application that has too many active threads is a bad thing, especially when shared data is updated frequently because locks are needed to protect the data.  When locks are taken frequently, even if the total time spent holding locks is very small, each thread runs only briefly before blocking on a lock.  By the time any thread runs again, its cache-state has been wiped out.  Also, preemption while holding a lock is more likely.  A good designer never holds a lock while making a call that could block because that inflates lock hold time.  Unfortunately, the designer doesn’t have much control over preemption and page faults, which also inflate lock hold time.

 

  

Guidelines for reducing the number of threads


Applications mainly have too many threads to simplify the code rather than to create parallelism.  The classic anti-pattern for this is handing off work to another thread and waiting for it to complete when the proper approach should be to make a function call. The exception to this rule is if the consumer needs to be in a different process or thread for isolation reasons.  But even then, the operation should always be asynchronous because the consumer may not be responding.

 

Another reason for having too many threads is not using asynchronous IO where appropriate.  It is not necessary to have “lazy-writer” or “read-ahead” threads.  Issue the IO asynchronously and handle the completion in the main state machine. 

 

Other reasons that are less under the control of the application designer are the need for separate “input handler” threads when trying to create a unified state machine to handle IO Completion Ports and RPC.  Also, using multiple components (RPC, COM, etc) results in multiple thread pools in the application because some components have their own thread-pools and each is unaware of the others when it makes its thread-throttling decisions.

 

Ideally, an application should have one thread per processor and it should never block.  In real practice, it is almost impossible to avoid calling foreign code that can block that single thread.  The compromise is to have a per processor main thread that executes the state machine and never calls code that may block.  Potentially-blocking operations must be handed-off to a thread pool so that if they do block, the main thread can still run.

 

Design recommendations for an application thread pool
 

A well designed thread pool should minimize the active “filler” threads (i.e. those released when the current thread blocks).  The application should have a single thread-pool which throttles “filler” threads by “parking” the excess ones at safe stopping points (i.e. when not holding any locks).  The worker threads should obtain their own work as opposed to having separate distribution threads (or a “listener” thread to set up a new connection).  Load balancing should not require a separate “load balancer” thread.  Idle workers should attempt to “steal” work from others (this should be kept at a minimum because it might cause cross-processor traffic). 

 

The Vista thread-pool has some improvements that can help.  The input queue is lock-free and thread-agnostic IO completion has eliminated the need for specialized “IO threads”.  It is possible to receive input from IO Completion Ports and ALPC (which eliminates the need for a separate input-handler thread).  The APIs to do this are TpBindFileToDirect and TpBindAlpcToDirect.

 

Common threading practices

 

·         Completion Port Thread Throttling

Each IO Completion Port has an active thread limit and keeps track of the number of active threads associated with the port.  The OS thread scheduler updates the active thread count when a thread blocks or resumes and it releases a “filler” thread to take the place of the blocked one.  The scheduler cannot “take back” a filler thread when the original thread resumes.  This is not an issue if threads hardly ever block on anything except the completion port.  The completion port thread-throttling mechanism cannot automatically “park” excess filler threads because it has no knowledge of the application and doesn’t know when it is safe to do so (it could be holding a lock... could detect if holding system lock, but not user lock).

 

·         Switch Threads between Requests or During  

A server application can spin up a thread to service each connection or it can maintain a pool of threads that service a larger number of connections.  Typically the switching of threads among connections occurs on a request boundary, but it could occur during the request (i.e. when it blocks).  No thread-throttling is required because no extra threads are released.  It can be done with SetJump/LongJump type user stack switching or by queuing a “resume” work packet instead of blocking inside a work packet. 

 

·         Handling Multiple Input Signals

It is often necessary for an application to handle input from multiple sources (e.g. shutdown event, registry change, IO completions, device/power notifications, incoming RPC).  Unfortunately there is no unified way to handle all of these.  The WaitForMultipleObjects API can handle some of the cases but it doesn’t cover IO Completion Port and RPC.  Also, WaitForMultipleObjects is limited to 64 objects and has a significant setup/teardown cost.  In many cases, WaitForMultiple(Any) can be replaced with a single event plus a type code in the payload data.  Another optimization is to use RegisterWaitForSingleObject to avoid “burning” a thread which sits waiting on an event.  Instead of having a separate thread that does nothing but wait on a registry change event, RegisterWaitForSingleObject can automatically queue a work item to a thread pool where it gets processed in the main loop along with IO completions, etc.

 

·         OS Thread Scheduling Basics

The thread is the unit of scheduling and the thread with the highest priority gets to run (not the ‘n’ highest where ‘n’ is the number of processors).  Applications specify “priority class” not actual priority.  Typically, a thread is boosted when readied and the boost decays as processor time is consumed.  If a thread consumes a “quantum” of a processor time without waiting, it must round-robin with equal priority threads.  Prior to Vista, determining when a thread consumed a quantum of processor time was done using timer-based sampling.  Now it is done using the hardware cycle counter and is much more accurate.

 

When a thread is readied, a search is done for a processor to run it.  First, an attempt is made to find an idle processor. While searching, the thread’s “ideal” processor is favored, followed by last processor and current processor.  NUMA nodes and physical vs. hyper-threaded processors are considered during the search (e.g. if ideal processor is not available, try other processors on the same NUMA node; if hyper-threaded, attempt to use idle physical processors first).  Secondly, if an idle processor cannot be found, attempt to preempt the thread’s “ideal” processor.  If thread’s priority is not high enough to preempt, queue it to the “ideal” processor.  A thread’s “Ideal” processor can be set using SetThreadIdealProcessor; otherwise it is assigned by the system in a way to spread the load but keep threads of the same process on the same NUMA node.

 

·         HyperThreading Specifics
The thread scheduler is hyper-threading aware and the OS uses the Yield instruction to avoid starving other virtual processors on the same physical processor when spinning.  User code that spins should use the YieldProcessor API for the same reason.  The GetLogicalProcessorInformation API can be used to get information about the relationship among cores and nodes as well as information about the caches such as size, linesize and associativity.    

 

 

Stay tuned for our next installment "Data Structures and Locking Issues" ...

Networking Adapter Performance Guidelines
18 March 08 07:46 AM | winsrvperf | 1 Comments   

Networking performance has increasingly become one of the most important factors in overall system performance. Many of the factors that affect networking performance fall under the following three categories: Network adapter hardware and driver performance, network stack performance, and the way applications interact with the network stack. We will highlight some of the more important networking adapter properties and advanced features available to yield optimal performance.

Introduction

Networking performance is measured in throughput and response time. It is also important to achieve the optimal performance operating point without over-utilizing the system resources. To reach that optimal performance point, the network adapter vendors have enhanced their hardware capabilities to improve scalability, the amount of time to process data, and built in capabilities to dynamically adjust hardware and software parameters depending on workload characteristics.

Network Adapter Hardware

Over the past decade, networking speeds have multiplied orders of magnitude to keep up with applications that have become network intensive and the load on the host processors increases for networking service routines.  In light of these changes, it has become increasingly important to consider offloading some of the tasks to the hardware and further optimize how the hardware interacts with the software to scale and improve performance.  Some of the features include Task Offload, TCP Offload, Interrupt Moderation, dynamic tuning on the hardware, Jumbo Frames, and Receive Side Scaling (RSS).  These are particularly important for the high-end network adapter that will be used in configurations requiring top performance.

Task Offload Features

The Microsoft networking stack can offload one or more of the following tasks to a network adapter that has the support for the offload capabilities.  The following are the supported offload tasks:

Checksum Offload: For most common IPv4 and IPv6 network traffic, offloading the checksum calculation to the network adapter hardware offers a significant performance advantage by reducing the number of CPU cycles required per byte and overall system performance improves.  UDP checksum offload support has been added in Windows Server 2008 and was not available in prior releases.
 

Large Send Offload: Applications that send down large amounts of data rely on the network stack to segment the large buffers into smaller Ethernet size buffers.  That size is typically 576 bytes on the Internet and 1500 bytes on LANs.  Large Send Offload (LSO) allows the coalescing of send data into 64KB segments and to offload the segmentation work to the hardware.  Offloading the work reduces the host CPU cycles and can improve overall system performance.  Giant Send Offload (GSO) is a superset of LSO which allows for send buffers coalescing into segments that are greater than 64KB.

IP Security (IPSec) Offload: Windows offers the ability to offload the encryption work of IPSec to the network adapter hardware. Encryption, especially 3 DES, has a very high cycles/byte ratio. Therefore, it is no surprise that offloading IPSec to the network adapter hardware has high performance yields depending on the scenario and the workload characteristics.

Interrupt Moderation

A simple network adapter interrupts the host processor upon receiving a packet or when sending a packet is complete. In many scenarios, where this high processor utilization, it is best to coalesce several packets for each interrupt and reduce the number of times the host processor is interrupted.  Because it is a common mistake to tune interrupt moderation for throughput and hurt response time, most network adapter vendors have implemented dynamic interrupt moderation schemes in their solutions.

Jumbo Frame Support

Supporting larger Maximum Transmission Units (MTUs) and thus larger frame sizes, specifically Jumbo Frames, will reduce the number of trips needed to send the same amount of data.  This results in a significant reduction of CPU utilization.

Receive Side Scaling

For large scale applications, being able to simultaneously process networking requests on multiple processors insures both improved performance through parallelism and CPU load distribution amongst the system processors.  Receive Side Scaling, when supported by the underlying hardware, provides this capability for TCP traffic and the technology is recommended for Web and File Server scenarios where a server is servicing requests from a large number of connections.

Network Adapter Resources

Most network adapters allow administrators to manually configure send and receive resources through the Advanced Networking tab for the adapter.  The most common ones are the receive and send buffers, which most of the time are configured to the low mark resulting in sub-optimal performance.  A small subset of adapters allow for dynamic adjustment for their networking resources so a manual configuration is unnecessary.

Network Adapter Characteristics

When choosing which network adapters to you use, you should always get 64-bit PCI-Express adapters.  Using 32-bit adapters will limit the amount of data the adapter can transfer and will provide sub-optimal performance when copying data into the adapter’s buffers.  Also, an adapter that is not PCI-E will get capped at around 7 Gigabit per seconds pure data transfer excluding headers.  This becomes a bottleneck if you are planning to use 10 Gigabit adapters in your setup and would like to get the full bandwidth out of it.

Network Adapter Tuning

The high performance features discussed earlier about network adapters are necessary for achieving best performance on most scenarios and workloads in single and multiple processor system configurations.  The following are guidelines for where to put the adapter in the system and how to best optimize the distribution of the network adapter interrupts. 

Bus Characteristics

It was mentioned earlier that it is best to place the network adapter in a 64-bit PCI-Express slot.  That guarantees the best performance.  If you will be using PCI or PCI-X slots, then you should try and place your adapter in a slot that does not share the bus with other devices in the system.  Most hardware vendors have diagrams on the inside of the servers or on their websites that describes the layout of the machine.  Sharing a bus with other devices can degrade performance because of latency on the bus and contention.

Interrupt Binding

Interrupts generated by the network adapter can be partitioned to a single or a select group of processors to improve performance.  An example would be partitioning your application threads to run on the same processor where network traffic is processed to preserve cache locality.  In order to bind an adapter’s interrupts to a select group of processors, we recommend using the Interrupt Affinity Policy tool. 

Resources

·         Windows Server 2008 Performance Tuning Guide
http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx

·         High Performance Network Adapters and Drivers
http://www.microsoft.com/whdc/device/network/NetAdapters-Drvs.mspx


Ahmed Talat
Performance Manager
Windows Server Performance Team

Windows Server 2008 - Scalability and Performance Presentation
07 March 08 12:16 AM | winsrvperf | 0 Comments   

Hi all, I thought I’d forward around a link to the WS08 performance presentation we did for the Server 2008 launch.  We cover a number of the areas/roles of the product and provide comparisons against Server 2003 – have a look at the attached XPS document!

Cheers,
Bill Karagounis
Group Program Manager
Windows Server Performance Team

Hyper-V and Multiprocessor VMs
29 February 08 01:43 AM | winsrvperf | 14 Comments   

Thanks for visiting our blog! I’m a development lead in the Windows Server Performance team and I led the performance effort on Hyper-V for Windows Server 2008 over the past three and a half years.

 

We’ve worked with the product team throughout the Hyper-V development cycle to deliver a competitive product and we’re excited about shipping Hyper-V RTM this year, with the Hyper-V Beta shipping in Windows Server 2008 this week!

 

Architectural Overview

 

Hyper-V ArchitectureHyper-V uses a hypervisor-based architecture and leverages the driver model of Windows for broad hardware support. The hypervisor partitions a server into containers of CPU and memory. As a micro-kernel, it provides mechanisms for inter-partition communication upon which our new high-performance synthetic I/O architecture is built. The root partition owns physical I/O devices and provides services including I/O implemented by the virtualization stack to the child partitions.

 

The virtualization stack implements emulated I/O devices such as an IDE controller and a DEC 21140A network adapter. However, it is expensive to virtualize such devices. Sending a single I/O might require multiple trips between the virtualization stack and child partition. Instead, Hyper-V exposes synthetic I/O devices that are specially designed for VM environments. These devices are attached to VMBus, which is a plug-and-play capable bus that uses shared memory for efficient inter-partition communication. The Windows guests detect the devices on VMBus and loads the appropriate drivers.

 

Hyper-V Synthetic IOSynthetic I/O in Hyper-V uses a client-server architecture with Virtualization Service Providers (VSPs) in the root and Virtualization Service Clients (VSCs) in the child. This architecture significantly reduces the cost of sending an I/O. Virtual Server customers should observe a major reduction in CPU usage in I/O-intensive loads when they migrate their VMs to Hyper-V.

 

In addition, we developed operating system enlightenments for Windows Server 2008, which make the NT kernel and memory manager smarter in VM environments, again to reduce the cost of virtualization.

 

Multi-Processor Guests

 

For this first blog post, I want to highlight one of the major performance features in Hyper-V: multi-processor virtual machines. Hyper-V supports 4P VMs for Windows Server 2008 guests and 2P VMs for Windows Server 2003 SP2 guests. For more intensive server workloads, you might consider virtualizing them in 2P or 4P VMs on Hyper-V. Of course, you should use multi-processor VMs only if the workload requires it since there is some cost to having additional processors.

 

However, operating system kernels and drivers use spin locks which do not block and spin until the lock is acquired, with the assumption that the lock is held for a short period. Virtualization breaks this assumption as virtual processors (VPs) are time-sliced. If a VP is preempted while holding a spin lock, other VPs may spin for a long time wasting CPU cycles.

 

We developed innovations in the hypervisor and Windows Server 2008 kernel to try to prevent long spin wait conditions and also to efficiently detect and handle them when they do occur. We also designed the hypervisor, including the scheduler and memory virtualization logic, to be lock-free on most critical paths to ensure good scalability on multi-processor systems.

 

As a result, Windows Server 2008 as a 4P guest scales well compared to the physical 4P system. This is one example of Windows Server 2008 as a guest and Hyper-V together providing performance advantages. We plan to continue to improving our scalability on multi-processor systems and multi-processor VMs in subsequent releases.

 

Closing Thoughts

 

Thanks for reading this far! I would encourage you to try Hyper-V Beta in Windows Server 2008, which launched this week. And take a look at the Windows Server 2008 and Virtualization web site for more information.

 

I look forward to writing more on our work on Hyper-V performance. Please add our blog to your RSS feeds!

 

Regards,

 

John Sheu

Senior Development Lead

Windows Server Performance Team

Welcome!
07 February 08 05:54 PM | winsrvperf | 0 Comments   

Welcome to the Microsoft Windows Server Performance team blog! As the Group Program Manager for the team, I’m delighted to introduce the team and provide the first post to kick off the blog.

The Windows Server Performance team is a part of the Core Operating System Division at Microsoft. Our charter is to understand and improve the performance of Windows Server. As a matter of engineering, the sort of work we do involves:

·         Measurement of performance;

·         Analysis to identify bottlenecks;

·         Identification and implementation of architectural and code changes to improve performance; and

·         To close the loop, verification that the changes we made, did what we expected them to do J

The work gives us an opportunity to see the OS as a whole, and to study the interaction between software components and between hardware and software.

We tend to focus on core scenarios and capabilities in Windows Server (other teams focus on role-specific performance), and look for ways to improve efficiency and scalability. Some of the areas we cover (and will post about) are virtualization, multi-core/multi-proc scalability, file systems (local and remote), network and disk I/O, and server power. We also plan to discuss OS and server application performance in general, and to share some of what we have learnt over the years.

We’ve already published v1.0 of our performance tuning whitepaper for Windows Server 2008 http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx - have a look and give us feedback (there’s a feedback link in the document). We will also be doing a Server 2008 performance webcast on Feb 8th to talk directly about the performance features and improvements in Server 2008.

Additionally, we’ll have some of the team on hand & presenting at the joint launch of Windows Server 2008, SQL Server 2008 and Visual Studio 2008 in Los Angeles on February 27th. See the launch site for more detail. We hope to see you there!

We are all looking forward to the day when everyone has the opportunity to use the next major release of our OS.

Thanks,
Bill Karagounis
Group Program Manager
Windows Server Performance Team

More Posts « Previous page

Search

This Blog

Syndication

Page view tracker