Tuning Windows Server 2008 for PHP
25 July 08 07:23 AM | winsrvperf | 3 Comments   

Tom Hawthorn, Karthik Mahesh - Windows Server Performance Team

A significant percentage of web sites utilize PHP as a platform for dynamic content.  During the development of Windows 2008, Microsoft included improvements that enable PHP to run more efficiently than previous Windows releases.  This article describes how to tune Windows 2008, IIS 7.0 and PHP for environments with a single site and high concurrent user traffic.

Introduction

Windows 2008 and IIS 7.0 have new features and optimizations that allow PHP to run more efficiently and robustly.   The most significant improvement Microsoft made was to update the CGI support in IIS 7.0 to conform to the Fast CGI standard.  Fast CGI saves a process create/destroy operation per request by pooling and reusing multiple worker processes.  This translates into significant performance improvements for PHP applications on Windows.

By default, the Fast CGI support in IIS 7.0 is configured to be conservative when allocating resources in order to best support the scenario where hundreds of web sites are running on a single physical server.  This was thought to be the most important deployment environment for PHP on Windows.  This article describes the differences between tuning a Windows server for a multi-site scenario versus a single site scenario assuming high overall traffic.

Multi-site versus Single-site Web Servers

Let’s talk about two typical environments for web servers where resource management and performance become top concerns: the “web hosting” environment and the “enterprise” environment.

In web hosting environments servers host hundreds of sites on a single physical machine.  There are dozens of companies that sell pre-packaged web sites for under $20 per month.  They achieve cost-efficiency by deploying hundreds of sites per single server machine and they place limits on the amount of traffic that each site can service.  Perhaps it is not much of a surprise that a huge percentage of web sites on the internet run on shared hardware.  Individually, hosted sites are low traffic but the aggregate traffic adds up to some serious load.  Administrators must take care to isolate the sites from each other for security reasons and must ensure that no single site can consume all the resources on the machine.  The default configuration values in Windows 2008/IIS 7.0 are optimized for the web hosting environment.

The enterprise environment is virtually the polar opposite from the hosted environment from the perspective of how software should manage resources.  Instead of strictly limiting and aggressively reclaiming the resources per-site, an administrator wants to give all of the machine’s resources to a single site.  Administrators achieve scale and robustness by load balancing the internet traffic across multiple machines serving the same web site content.  This article describes how to tune Windows 2008 and IIS 7.0 for the enterprise environment.

Request Concurrency

Web requests are made from the user’s web browser to a web server.  The server receives the request, processes it and sends back some data.  An HTTP request is usually small, maybe a few hundred bytes.  However, the response may be large and the ephemeral memory required to generate the response can be even larger.  During the period in which the web page is executing code or waiting for a response from somewhere else (i.e. a database, disk, another web site, etc…) the memory associated with the web request cannot be released.  Once the request is completed usually memory can be released with the exception of cached items.

I refer to the term “request concurrency” to describe the number of requests being processed on a web server at any given moment in time.  As average request concurrency grows on a web server, so does the average memory utilization.  In order to conserve memory a web server can limit request concurrency by queuing new incoming requests rather than servicing it immediately if the number of in-flight requests exceeds some limit.  This approach has the side effect of increasing latency because user requests may need to wait for in-flight requests to complete before they are handled.

Increasing the Default Concurrency Limits

On Windows 2008, an HTTP request will be handled by multiple software component layers beginning with the network stack and travelling up into IIS and then sometimes into third party technologies such as PHP.  Each layer will perform some work on the HTTP request before handing the request on to the next layer.  Each time a layer receives a new HTTP request, it has the option of queuing it or processing it.  Therefore, increasing concurrency limits involves modifying configuration associated with multiple layers.  This section describes configuration parameters in http.sys, IIS 7.0, FastCGI and PHP.

 

IIS 7/queueLength

Description:

This parameter controls the maximum number of requests that IIS 7.0 will allow to be queued simultaneously.  It allows the system to be more robust in handling spikes in request concurrency beyond the configured limits.

 

Normally, if a web request is received by a web server and its queue is full the web server will return an HTTP error 503 (service unavailable).  Increasing the queue limit value has no impact on a web server that does not exceed its queue limits under normal conditions.  On web servers that experience occasional bursts of requests that would exceed the default queue limits, increasing the limit may allow the server to satisfy all requests without error but with a higher latency.  On web servers that are overloaded during steady-state operation increasing this value may have a detrimental effect.

 

Default Value:

1000

 

Tuned Value:

65535

 

Command Line:

appcmd.exe set set apppool "DefaultAppPool" /queueLength:65535

 

 

 

IIS 7/appConcurrentRequestLimit

Description:

This parameter controls the maximum number of in-flight requests in the IIS 7.0 layer.  This includes requests that are being processed or are queued by the CGI layer.

 

Increasing this value on a web server that never experiences more than 5000 concurrent requests should have no impact.  On web servers that receive very large numbers of concurrent requests and that have available resources during steady state load, increasing this setting will allow the server to fully utilize its memory and CPU.  Servers that are already 100% utilized may be negatively impacted by increasing concurrent request limits.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000

 

 

 

Http/MaxConnections

Description:

This parameter controls the maximum number of concurrent TCPIP connections that HTTP will allow.

 

By default, only 5000 concurrent TCPIP connections are allowed by the HTTP driver in Windows.  There is typically only one outstanding HTTP request per connection, therefore increasing any other concurrency limit is pointless unless the maximum number of concurrent connections is also increased.  Each connection maintained by Windows will use some kernel memory and requires some CPU to maintain state.  I don’t recommend increasing this limit on 32 bit machines because of the limited kernel address space.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

reg add HKLM\System\CurrentControlSet\Services\HTTP\Parameters /v MaxConnections /t REG_DWORD /d 1000000

 

 

 

FastCGI/Php Concurrency

Description:

This is actually two parameters, the first is the maximum concurrent requests and the second is the number of requests that can be executed by a fast CGI process before the process is recycled. 

 

The CGI model requires only a single concurrent request per pooled process.  So the max instances parameter tells IIS how many processes to start up.  Each process will consume significant resources on the server so the initial recommendation of 32 is somewhat conservative.  Increasing the number of requests that each process can handle before being recycled merely decreases the rate of process creation/destruction and reduces the average CPU required to process each request.

 

Default Value:

4/200

 

Tuned Value:

32/10000

 

Instructions:

1.       notepad %windir%\system32\inetsrv\config\applicationhost.config

2.       find the "fastCGI" element, change it to the following (assuming php-cgi.exe is in c:\php)

 

<fastCgi>

  <application fullPath="C:\PHP\php-cgi.exe" instanceMaxRequests=“10000" maxInstances="32">

    <environmentVariables>

      <environmentVariable name=”PHP_FCGI_MAX_REQUESTS” value=”10000”/>

    </environmentVariables>

  </application>

</fastCgi>

 

Conclusion

Tuning a Windows 2008 machine for PHP performance in enterprise environments is all about increasing the default concurrency limits.  Remember, if you try out some of the tunings in this article make sure to test the effects of the changes in a controlled environment before deploying them to your front line servers.   Increasing the concurrency limits will generally have the effect of increasing the steady state memory utilization and CPU if concurrency is a bottleneck on your system.  If you don’t have enough memory or your CPU is already fully utilized, don’t increase the concurrency limits!  Finally, the tuned values in this article are values that I found empirically in my own test environment.  They may or may not be the right values for your environment so play around with them to find out what works for you.  Happy tuning!

Designing Applications for High Performance - Part III
26 June 08 12:44 AM | winsrvperf | 4 Comments   
 
Rick Vicik - Architect, Windows Server Performance Team

 

The third, and final, part of this series covers I/O Completions and Memory Management techniques.  I will go through the different ways to handle I/O completions with some recommendations and optimizations introduced in Vista and later releases of Windows.  I will also cover tradeoffs associated with designing single and multiple process applications.  Finally, I will go through memory fragmentation, heap management, and provide a list of the new and expanded NUMA topology APIs.   

 

Some common I/O issues

It is recommended to use asynchronous I/O to avoid switching threads and to maximize performance.  Asynchronous I/O is more complex because it needs to determine which of the many outstanding I/Os completed.  Those I/Os may be to different handles, mixed file and network I/O, etc.  There are many different ways to handle I/O completion, not all of which are suitable for high performance (e.g. WaitForMultipleObjects, I/O Completion Routines).  For highest performance, use I/O Completion Ports.  Prior to Vista, scanning the pending overlapped structures was necessary to achieve the highest performance, but the improvements in Vista have made that technique obsolete.  However, it should be noted that an asynchronous write can block when extending a file and there is no asynchronous OpenFile. 

 

The old DOS SetFilePointer API is an anachronism.  One should specify the file offset in the overlapped structure even for synchronous I/O.  It should never be necessary to resort to the hack of having private file handles for each thread.

 

Overview of I/O Completion Processing

The processor that receives the hardware interrupt runs the interrupt service routine (ISR).  Interrupts are either distributed across all processors or the interrupts from a specific device can be bound to a particular processor or set of processors.  The ISR queues a DPC (usually to the same processor, otherwise an IPI is required) to perform the part of I/O completion processing that doesn’t need access to the user address space of the process that issued the I/O.  The DPC queues a “Special Kernel APC” to the thread that issued the I/O to copy status and byte count information to the overlapped structure in the user process.  In the case of buffered I/O, the APC also copies data from the kernel buffer to the user buffer.  In the case of I/O Completion Routines (not ports), the “Special Kernel APC” queues a user APC to itself to call the user-specified function.  Moreover, prior to Vista every I/O completion required a context-switch to the thread that issued the I/O in order to run the APC.

 

These APCs are disruptive because as the number of processors and threads increases, the probability that the APC will preempt some other thread also increases.  This disruption is less likely to happen by fully partitioning the application.  That includes having per processor threads and binding interrupts to specific processors. 

 

What’s new for I/O in Vista and Above

·         Ability to flag an I/O as low priority.  This reduces the competition between background and foreground tasks, and improves I/O bandwidth utilization.  Low priority IO is exposed via SetPriorityClass PROCESS_MODE_BACKGROUND_BEGIN, also by NtSetInformationProcess(process,ProcessIoPriority,...

·         There are no disruptive APCs running when using I/O Completion Ports.  Also, this can be accomplished for Overlapped structs if the user locks them in memory by using the SetFileIoOverlappedRange call.

·         Ability to retrieve up to ‘n’ I/O completions with a single call to GetQueuedCompletionStatusEx. 

·         The option to skip setting the event in the file handle and skip queuing a dummy completion if the I/O completes in-line (i.e. did not return PENDING status).  These can be done by making a call to SetFileCompletionNotificationModes. 

·         The Dispatcher lock is not taken when a completion is queued to a port and no threads are waiting on that port.  Similarly, no lock gets taken when removing a completion if there are items in the queue when GetQueuedCompletionStatus is called because again the thread does not need to wait for an item to be inserted.  If the call to GetQueuedCompletionStatus was made with zero timeout, then no waiting takes place.  On the other hand, the lock is taken if queuing a completion wakes a waiting thread or if calling GetQueuedCompletionStatus results in a thread waiting.

 

I/O Completion Port Example

Let’s take an example where the main thread loops on GetQueuedCompletionStatus and calls the service function which was specified when the I/O was issued (passed via an augmented Overlapped structure).  The service functions issue only asynchronous I/O and do not wait, therefore the only wait in the main thread is really on the call made to GetQueuedCompletionStatus.  The following are some examples of “events” whose completion we wait on and suggestions on what to do next once they complete:

 

-          If the completion is for a new connection establishment, set up a session structure and issue an asynchronous network receive. 

-          If the completion is for a network receive, parse the request to determine the file name and issue a call to TransmitFile API. 

-          If the completion is for a network send, log the request and issue an asynchronous network receive. 

-          If the completion is for a user signal (from PostQueuedCompletionStatus), call the routine specified in the payload.

 

The timeout parameter on GetQueuedCompletionStatus (GQCS) can cause it to wait forever, return after the specified time, or return immediately.  Completions are queued and processed FIFO, but threads are queued and released LIFO.  That favors the running thread and treats the others as a stack of “filler” threads.  Because in Vista the Completion Ports are integrated with the thread pool and scheduler, when a thread that is associated with a port waits (except on the port) and the active thread limit hasn’t been exceeded, another thread is released from the port to take the place of the one that waited.  When the waiting thread runs again, the active thread count of the port is incremented.  Unfortunately, there is no way to “take back” a thread that is released this way.  If the threads can wait and resume in many places besides GQCS (as is usually the case), it is very common for too many threads to be active.

 

PostQueuedCompletionStatus allows user signals to be integrated with I/O completion handling which allows for a unified state machine.

 

Characteristics of I/O Completion Ports

An I/O Completion Port can be associated with many file (or socket) handles, but not the reverse.  The association cannot be changed without closing and reopening the handle.  It is possible to create a port and associate it with a file handle using a single system call, but additional calls are needed to associate a port with multiple handles. 

 

While you don’t need an event in the Overlapped structure when using Completion Ports because the event is never waited on, if you leave it out, the event in the file handle will be set and that incurs extra locking.

 

In Vista, applications that use Completion Ports get the performance benefit of eliminating the IO Completion APC without any code changes or even having to recompile.  This is true even if buffered IO is used.  The other way to get the benefit of IO Completion APC elimination (locking the overlapped structure) requires code changes and cannot be used with buffered IO.

 

Even if the I/O completes in-line (and PENDING is not returned), the I/O completion event is set and a completion is queued to the port unless the SetFileCompletionNotificationModes option is used.

 

if( !ReadFile(fh,buf,size,&actual,&ioreq)){

    // could be an error or asynchronous I/O successfully queued

    if( GetLastError() == ERROR_IO_PENDING ) {

      // asynchronous I/O queued and did not complete “in-line”

    } else {

      // asynchronous I/O not queued or was serviced in-line and failed

    }

} else {

    // completed in-line, but still must consume completion

    // unless new option specified

}

 

 Memory Management Issues

When designing an application, developers are often faced with questions like - Should the application be designed with a single process or multiple processes?  Should there be a separate process for each processor or node?  In this section, we try to answer some of these questions while providing the advantages and disadvantages for each approach.

 

Advantages for designing applications with multiple processes include isolation, reliability and security.  First, an application can take advantage of more than 4GB of physical memory because each process can use up to 2GB for user-mode data.  Second, if memory is corrupted by bad code in one of the processes, the others are unaffected (unless shared memory is corrupted) and the application as a whole does not need to be terminated.  Also, separate address spaces provide isolation that can’t be duplicated with multiple threads in a single process.

 

Some disadvantages of using multiple processes include higher cost of a context switch compared to a thread switch in the same process due to the TLB getting flushed.  Also there are possible performance bottlenecks introduced by the mechanism chosen for Inter-Process Communication (IPC).  Examples of IPC include RPC, pipes, ALPC, and shared memory, so it is important that the right kind of IPC is chosen. Some estimates for round trip cost to send 100 bytes via RPC: 27,000 cycles, local named pipes: 26,000 cycles, ALPC: 13,000.  IPC via shared memory is the fastest but it erodes the isolation benefit of separate processes because bad code can potentially corrupt data in the shared memory.  Also with shared memory it is not always possible to use the data “in-place” and copying incurs an added cost of 2.5 cycles per byte copied.

 

Advantages for designing applications with a single process include not needing cross-process communication, cross process locks, etc.  Single process application can also approximate some of the advantages associated with multiple processes via workarounds.  For instance, exception-handing can trap a failing thread making it unnecessary to terminate the entire process.  The 2GB user virtual address limit is gone on x64 and can be worked around to some degree on 32bits using Address Windowing Extension (AWE) or the 3GB switch to change the user/kernel split of the 4GB virtual address space from 2:2 to 3:1. 

 

Shared Memory Issues

Shared memory is the fastest IPC but sacrifices some of the isolation that was the justification for using separate processes.  Shared memory is secure to outsiders once set up, but the mechanism by which the multiple processes gain access to it has some vulnerability.  Either a name must be used that is known to all the processes (which is susceptible to “name squatting”) or else a handle must be passed to the processes that didn’t create the shared memory (using some other IPC mechanism). 

 

Managing updates to shared memory:

1. Data is read-write to all processes, use cross-process lock to guard data or lock-free structures. 

2. Data is read-only to all but 1 process which does all updates (w/o allowing readers to see inconsistencies)

3. Same as 2 but kernel does updates

4. Data is read-only, unprotect briefly to update (suffering TLB flush due to page protection change).

 

Global Data defined in a DLL is normally process-wide but the linker “/SECTION:.MySeg,RWS” option can be used to make it system-wide if that is what is needed.  Just loading the DLL causes it to be set up as opposed to the usual CreateSection/MapViewOfSection APIs.  The downside is that the size is fixed at compile time.

 

Memory Fragmentation – What is it?

Fragmentation can occur in the Heap or in the process Virtual Address Space.  It is a consequence of the “best fit” memory allocation policy and a usage pattern that mixes large, short-lived allocations with small, long-lived ones.   It leaves a trail of free blocks (each too small to be used) which cannot be coalesced because of the small, long-lived allocations between them.  It cannot happen if all allocations are the same size or if all are freed at the same time.  Avoid fragmentation by not mixing wildly different sizes and lifetimes.  Large allocations and frees should be infrequent and batched.  Consider rolling your own “zone” heap for frequent, small allocations that are freed at the same time (e.g. constructing a parse tree).  Obtain a large block of memory and claim space in it using InterlockedExchangeAdd (to avoid locking).  If the zone is per-thread, there is no need for even the interlocked instruction.  Use the Low Fragmentation Heap whenever possible.  It is NUMA-aware and lock-free in most cases.  It replaces the heap look-asides and covers up to 16KB allocations.  It combines the management of free and uncommitted space to eliminate linear scans.  It is enabled automatically in Vista or by calling HeapSetInformation on the heap handle.

 

Best practices when managing the Heap

·         Don’t use GlobalAlloc, LocalAlloc, WalkHeap, ValidateHeap, or LockHeap.  GlobalAlloc and LocalAlloc are old functions which may take an additional lock even if the underlying heap uses lock-free look-asides.  The WalkHeap and ValidateHeap functions disable the heap look-asides.

 

·         Don’t “shrink to fit” (i.e. allocate a buffer for largest possible message, then realloc to actual size) or “expand to fit” (i.e. allocate typical size, realloc ‘n’ bytes at a time until it fits).  These are fragmentation-producing allocation patterns and realloc often involves copying the data.

 

·         Don’t use the heap for large buffers (>16KB).  Buffers obtained from the heap are not aligned on a natural boundary.  Use VirtualAlloc for large buffers, but do it infrequently, carve them up yourself and recycle them.

 

·         New in Vista: dynamic kernel virtual address allocation... no longer need to manually juggle the sizes of the various kernel memory pools when one of them runs out of space (e.g. desktop heap).

 

·         New in Vista:  Prefetch API - PreFetchCacheLine(adr,options).  The API has a large dependency on the hardware’s support for prefetch.

 

NUMA Support

APIs to get topology information (depends on hardware for the information):

 

  GetNumaHighestNodeNumber

  GetNumaProcessorNode (specified processor is on which node)

  GetNumaNodeProcessorMask (which processors are on the specified node)

  GetNumaAvailableMemory (current free memory on specified node)

 

Use existing affinity APIs to place threads on desired nodes.

 

Memory is allocated on node where thread is running when the memory is touched for the first time (not at allocation time).  For better control over where memory is allocated, use new “ExNuma” versions of the memory allocation APIs.  The additional parameter specifies node.  It is a preferred, not absolute specification and it is 1-based because 0 signifies no preference.

 

  VirtualAllocExNuma(..., Node)

  MapViewOfFileExNuma(..., Node)

  CreateFileMappingExNuma(..., Node)

 

Large pages and the TLB

The Translation Look-aside Buffer (TLB) is a critical resource on machines with large amounts of physical memory.  Server applications often have large blocks of memory which should be treated as a whole by the memory manager (instead of 4KB or 8KB chunks), e.g. a database cache.  Using large pages for those items reduces the number of TLB entries, improves TLB hit ratio, and decreases CPI.  Also, the data item is either all in memory or all out.  

 

  minsize = GetLargePageMinimum();

  p = VirtualAlloc(null, n*minsize, MEM_LARGE_PAGES, ...);

Power and Hyper-V are now part of the Windows Server 2008 Tuning Guide!
17 June 08 08:58 PM | winsrvperf | 3 Comments   

The guide has been updated with sections on Power and Hyper-V guidelines and best practices.  Check out the updated Tuning Guide and tell us what you think by following the feedback link at the top of the Tuning Guide.  We look forward to hearing from you!

Ahmed Talat
Performance Manager
Windows Server Performance Team

Filed under: , ,
Designing Applications for High Performance - Part II
21 May 08 02:48 AM | winsrvperf | 0 Comments   

Rick Vicik - Architect, Windows Server Performance Team

The second part of this series covers Data Structures and Locks. I will provide general guidance on which data structures to use under certain circumstances and how to use locks without having a negative impact on performance.  Finally, there will be examples covering common problems/solutions and a simple cookbook detailing functional requirements and recommendations when using data structures and locks.

 

In order to avoid cache line thrashing and a high rate of lock collisions, the following are suggested guidelines when designing an application:

 

·         Minimize the need for a large number of locks by partitioning data amongst threads or processors. 

·         Be aware of the hardware cache-line size because accessing different data items that fall on the same cache-line is considered a collision by the hardware. 

·         Use the minimum mechanism necessary in data structures.  For example, don’t use a doubly-linked list to implement a queue unless it is necessary to remove from the middle or scan both ways. 

·         Don’t use a FIFO queue for a free list. 

·         Use lock-free techniques when possible.  For example, claim buffer space with InterlockedAdd and use an S-List for free lists or producer/consumer queues. 

 

The “ABA” problem

The “ABA” problem occurs when InterlockedCompareExchange is used to implement lock-free data structures because what is really needed is the ability to detect any change, even a set of multiple changes that restores the original value.  If 2 items are removed from an S-List and the first is replaced, just comparing values would not detect that the list has changed (and the local copy of the ‘next’ pointer from the first item on the list is “stale”).  The solution is to add a version# to the comparison so that it fails after any change, even one that restored the old value.  That gets tricky in x64 because the size of a pointer is the same as the maximum size InterlockedCompareExchange.

 

The “Updating of Shared Data” problem

The primary concern when handling updates to shared data is to be aware of all the ways an item can be reached.  When removing from the middle of a doubly-linked list, an item becomes unreachable when the next and previous pointers of the adjacent items are nulled.  The tricky part is that a thread may have made a local copy of the “next” pointer from the previous item before it was nulled but hasn’t yet accessed that “next” item.  A “lurking reader” must make its presence known to others by using a refcount or by taking per-item locks in a “crabbing” fashion as it traverses the list.  The refcount can’t be on the target item because the lurking reader hasn’t gotten there yet.  It must be on the pointer that was used to get there or else logically apply to the list as a whole.  The “crabbing” technique is usually too expensive in most cases.  It is almost always necessary to have a lock which guards the list.

 

True Lock free Operations

The simple test for a true “lock-free” operation is whether or not a thread can die anywhere during the operation and not prevent other threads from doing the operation.  It is commonly thought that replacing a pointer with a “keep out” indicator is better than using a lock, but it is really just folding the lock and pointer into the same data item.  If the thread dies after inserting the “keep out” indicator, no other thread can get in.

 

Guidelines for good locking practices and things to avoid

 

·         Hash Lookup (cache index or “timer wheel”)

A good general practice is to use an array of synonym list heads with a lock per list (or set of lists), where locks fall on different cache lines.  This design does not increase the number of lock acquires per lookup compared to the single lock implementation while reducing the lock collision rate.  Use a doubly-linked list for synonyms to support removal from the middle (if lookup always precedes removal, a singly-linked list can be used).  If the access pattern is mostly lookups with occasional insert/remove, use a Read/Write lock with a “writer priority” policy to provide fairness.  If the entry is not found, then an exclusive lock is needed to insert it.  If the lock doesn’t support atomic promotion, then must drop, reacquire and rescan.  To avoid memory allocations (and possibly waiting) while holding lock, allocate the new block between the dropping and re-acquiring of the lock.

 

·         N:M Relationship

Suppose a company allows a many-to-many relationship between employees and projects.  That requires a set of intersecting lists to represent the relationships and a single lock can probably be used to guard it.  That solution might be sufficient if most access is read-only, but will become a bottleneck if updates are frequent.  A finer-grain solution is to have a lock on each instance of project and employee.  Adding or removing a relationship requires taking the 2 intersecting locks, which is slightly more than the single lock implementation.  A number of optimizations are possible to avoid lock-ordering issues:  Deferred removal of intersection blocks, one-at-a-time insertion & removal, InterlockedCompareExchange for removal.

 

·         Lock Convoy

FIFO locks guarantee fairness and forward progress at the expense of causing lock convoys.  The term originally meant several threads executing the same part of the code as a group resulting in higher collisions than if they were randomly distributed throughout the code (much like automobiles being grouped into packets by traffic lights).  The particular phenomenon I’m talking about is worse because once it forms the implicit handoff of lock ownership keeps the threads in lock-step.