An application needs to be aware of the number of processors because it may need to distribute the load if the link or port aggregation technique isn’t good enough. It may also need to perform a type of load balancing if the requests differ significantly in processing time. Soft affinity (set via SetThreadIdealProcessor API) may be enough, but if the threads are hard-affinitized to processors (set via SetThreadAffinityMask API), periodic work-stealing logic may be needed to avoid some processors going idle while work queues up on others. The handling of I/O completion gets trickier, but more details are provided later and I will explain how using Vista can help.
The first part of this series will cover Threads and side effects associated with having too many active threads contending for resources or trying to update the same piece of memory. It will also provide an overview of some of the improvements that have gone into Vista for thread handling.
Threading Issues
An application that has too many active threads is a bad thing, especially when shared data is updated frequently because locks are needed to protect the data. When locks are taken frequently, even if the total time spent holding locks is very small, each thread runs only briefly before blocking on a lock. By the time any thread runs again, its cache-state has been wiped out. Also, preemption while holding a lock is more likely. A good designer never holds a lock while making a call that could block because that inflates lock hold time. Unfortunately, the designer doesn’t have much control over preemption and page faults, which also inflate lock hold time.
Guidelines for reducing the number of threads
Applications mainly have too many threads to simplify the code rather than to create parallelism. The classic anti-pattern for this is handing off work to another thread and waiting for it to complete when the proper approach should be to make a function call. The exception to this rule is if the consumer needs to be in a different process or thread for isolation reasons. But even then, the operation should always be asynchronous because the consumer may not be responding.
Another reason for having too many threads is not using asynchronous IO where appropriate. It is not necessary to have “lazy-writer” or “read-ahead” threads. Issue the IO asynchronously and handle the completion in the main state machine.
Other reasons that are less under the control of the application designer are the need for separate “input handler” threads when trying to create a unified state machine to handle IO Completion Ports and RPC. Also, using multiple components (RPC, COM, etc) results in multiple thread pools in the application because some components have their own thread-pools and each is unaware of the others when it makes its thread-throttling decisions.
Ideally, an application should have one thread per processor and it should never block. In real practice, it is almost impossible to avoid calling foreign code that can block that single thread. The compromise is to have a per processor main thread that executes the state machine and never calls code that may block. Potentially-blocking operations must be handed-off to a thread pool so that if they do block, the main thread can still run.
Design recommendations for an application thread pool
A well designed thread pool should minimize the active “filler” threads (i.e. those released when the current thread blocks). The application should have a single thread-pool which throttles “filler” threads by “parking” the excess ones at safe stopping points (i.e. when not holding any locks). The worker threads should obtain their own work as opposed to having separate distribution threads (or a “listener” thread to set up a new connection). Load balancing should not require a separate “load balancer” thread. Idle workers should attempt to “steal” work from others (this should be kept at a minimum because it might cause cross-processor traffic).
The Vista thread-pool has some improvements that can help. The input queue is lock-free and thread-agnostic IO completion has eliminated the need for specialized “IO threads”. It is possible to receive input from IO Completion Ports and ALPC (which eliminates the need for a separate input-handler thread). The APIs to do this are TpBindFileToDirect and TpBindAlpcToDirect.
Common threading practices
· Completion Port Thread Throttling
Each IO Completion Port has an active thread limit and keeps track of the number of active threads associated with the port. The OS thread scheduler updates the active thread count when a thread blocks or resumes and it releases a “filler” thread to take the place of the blocked one. The scheduler cannot “take back” a filler thread when the original thread resumes. This is not an issue if threads hardly ever block on anything except the completion port. The completion port thread-throttling mechanism cannot automatically “park” excess filler threads because it has no knowledge of the application and doesn’t know when it is safe to do so (it could be holding a lock... could detect if holding system lock, but not user lock).
· Switch Threads between Requests or During
A server application can spin up a thread to service each connection or it can maintain a pool of threads that service a larger number of connections. Typically the switching of threads among connections occurs on a request boundary, but it could occur during the request (i.e. when it blocks). No thread-throttling is required because no extra threads are released. It can be done with SetJump/LongJump type user stack switching or by queuing a “resume” work packet instead of blocking inside a work packet.
· Handling Multiple Input Signals
It is often necessary for an application to handle input from multiple sources (e.g. shutdown event, registry change, IO completions, device/power notifications, incoming RPC). Unfortunately there is no unified way to handle all of these. The WaitForMultipleObjects API can handle some of the cases but it doesn’t cover IO Completion Port and RPC. Also, WaitForMultipleObjects is limited to 64 objects and has a significant setup/teardown cost. In many cases, WaitForMultiple(Any) can be replaced with a single event plus a type code in the payload data. Another optimization is to use RegisterWaitForSingleObject to avoid “burning” a thread which sits waiting on an event. Instead of having a separate thread that does nothing but wait on a registry change event, RegisterWaitForSingleObject can automatically queue a work item to a thread pool where it gets processed in the main loop along with IO completions, etc.
· OS Thread Scheduling Basics
The thread is the unit of scheduling and the thread with the highest priority gets to run (not the ‘n’ highest where ‘n’ is the number of processors). Applications specify “priority class” not actual priority. Typically, a thread is boosted when readied and the boost decays as processor time is consumed. If a thread consumes a “quantum” of a processor time without waiting, it must round-robin with equal priority threads. Prior to Vista, determining when a thread consumed a quantum of processor time was done using timer-based sampling. Now it is done using the hardware cycle counter and is much more accurate.
When a thread is readied, a search is done for a processor to run it. First, an attempt is made to find an idle processor. While searching, the thread’s “ideal” processor is favored, followed by last processor and current processor. NUMA nodes and physical vs. hyper-threaded processors are considered during the search (e.g. if ideal processor is not available, try other processors on the same NUMA node; if hyper-threaded, attempt to use idle physical processors first). Secondly, if an idle processor cannot be found, attempt to preempt the thread’s “ideal” processor. If thread’s priority is not high enough to preempt, queue it to the “ideal” processor. A thread’s “Ideal” processor can be set using SetThreadIdealProcessor; otherwise it is assigned by the system in a way to spread the load but keep threads of the same process on the same NUMA node.
· HyperThreading Specifics
The thread scheduler is hyper-threading aware and the OS uses the Yield instruction to avoid starving other virtual processors on the same physical processor when spinning. User code that spins should use the YieldProcessor API for the same reason. The GetLogicalProcessorInformation API can be used to get information about the relationship among cores and nodes as well as information about the caches such as size, linesize and associativity.
Stay tuned for our next installment "Data Structures and Locking Issues" ...