Configuring Windows Server 2008 Power Parameters for Increased Power Efficiency
04 December 08 11:10 PM | winsrvperf | 4 Comments   

Matthew Robben here, I’m a Program Manager on the Windows Server Performance team and my primary responsibility is Windows Server power management. Server power efficiency is a topic of considerable importance – in today’s difficult economy, IT organizations need to contain and reduce costs. Yet the cost of energy to power and cool a 1U server is now more than the amortized cost of the server (over 3 years). 

 

Energy efficient hardware and software reduces operational costs and directly impacts an organization’s bottom line. We’re in the midst of developing Windows Server 2008 R2, and one of our goals for the product is to build a server operating system that is more power efficient than all of our previous releases. Furthermore, to help IT administrators better understand server power management and optimize their current Windows Server 2008 installations, we’re releasing a comprehensive white paper called “Power In, Dollars Out: Reducing the Flows in the Data Center” today. The white paper gives detailed explanations of many factors affecting server power efficiency, and contains a list of best practices for optimization.

 

One of the stated best practices is to properly configure Windows Server 2008’s power management features. According to the Green Grid, just turning on PPM features in the operating system can reduce power consumption by 20%. In Windows Server, this can be done simply by choosing the Balanced or Power Saver power policies (found in the Power Options applet in the Control Panel). Of course, PPM is a complicated technology, with many more toggles than a simple on/off switch. We’ve done quite a bit of work on the Windows Server processor power management (PPM) algorithms and parameters during R2 development. One of the results of this work was the development of a set of parameters that can boost power efficiency by up to 10% on standard benchmark workloads.

 

Good news - you don’t need to wait until R2 to deploy these new parameters on your servers. This blog post will describe PPM technology, explain the parameters involved, and show benchmark test results for the parameter changes on a commodity server. It will also give you a handy command-line walkthrough of the powercfg.exe commands necessary to implement these changes in your environment.

 

First, some context. Power management requires cooperation from the hardware and the operating system to work efficiently. For example, hardware might support low power states, but the operating system schedules computational work and is in the best position to decide when low power states can be leveraged. The Advanced Configuration and Power Interface (ACPI) defines an interface between the operating system and server hardware to be used for power management purposes.

ACPI Processor Performance States

The processor has traditionally consumed the most power in a server, which makes it a great candidate for power-efficiency optimizations. To add detail and flexibility for processor power management, ACPI defines a few sets of states for processors. Performance states, or P-states, are one such state that can be leveraged to increase power efficiency.

P-States

Processors can transition between multiple performance states, or P-states. P-states define incremental levels of processor performance, from P0 (most performant) to Pn (least performant). The ACPI specification does not specify a maximum number of P-states, so Pn is used to refer to the highest numbered, lowest performant P-state that a processor supports.

Each successively higher numbered P-state consumes less power than the previous P‑state. Processors can dynamically switch between these states during operation to provide only as much computational capacity as is necessary, which saves power during periods of low usage.

Figure 1 below shows a hypothetical set of six P-states that would be available to a processor. Note that the maximum P-state (P0) has the highest frequency, while successively higher numbered P-states reduce in frequency. In this case, the minimum P-state is P5, so the terms Pn and P5 would be interchangeable.

P-state explanatory chart

Figure 1. Illustration of P-state number and corresponding frequency

Tuning P-State Parameters for Increased Power Efficiency

Windows Server contains a number of configurable P-state parameters. These can be used to finely tune the power/performance balance of Windows Server PPM. The defaults for these parameters are tuned to deliver excellent power efficiency for most systems and workloads out of the box. However, these are “safe” defaults. They balance performance and power efficiency. Default settings are shown in Table 1. Note that “P-state increase” in this context refers to a transition to a lower numbered, more performant P-state, whereas “P-state decrease” refers to a transition to a higher numbered, less performant P-state. Looking back to Figure 1, an increase would mean moving upward in the chart while a decrease would mean moving downward.

Table 1. Default P-State Parameter Settings in Windows Server 2008

Name

Default

Description

Time Check

100 ms

The time interval at which the operating system considers a change of the current P-state.

Increase Time

100 ms

The minimum time period that must expire before considering a P-state increase.

Decrease Time

300 ms

The minimum time period that must expire before considering a P-state decrease.

Increase Percentage

30%

The utilization percentage1 that the CPU must exceed to increase P‑state.

Decrease Percentage

50%

The utilization percentage that the CPU must be below to decrease P-state

Domain

Accounting

Policy

0 (On)

Determines how the kernel power manager accumulates idle time. Settings:

 0 (On): idle time is accumulated only when all processors in an idle state domain2 are idle.

1 (Off): idle time is accumulated and P-states are calculated for each processor without regard to any other processor in the domain.

Increase Policy

IDEAL (0)

Determines how P-state transition decisions are made. Settings:

 IDEAL (0): calculates the target P-state based only on processor utilization and then finds a nearby available P-state on the system.

SINGLE (1): calculates an ideal P-state but only increases or decreases by one P-state per time check interval.

ROCKET (2):  transitions to the highest P-state available on increase or lowest P-state available on decrease

Decrease Policy

SINGLE (1)

1The utilization percentage referenced here is not the same as the CPU usage counter in the Task Manager tool. Without going into more details, this setting is best optimized through empirical experimentation.

2A “state domain” is a dependency between different processor cores or packages on a server.  Often, processor designs require that if one core is at a particular performance or idle state, the other cores or packages in the domain must also be at the same state. The hardware notifies the operating system of this dependency by establishing a domain through the ACPI interface.

During Windows Server 2008 R2 development, our team determined a set of parameters that can boost energy efficiency with a very minor performance cost. Notice in Table 1 that the decrease time default is larger than the increase time default. This setting favors P-state increases over decreases. The default increase and decrease percentage settings of 30 and 50 percent, the default domain accounting policy, and the increase and decrease policy defaults favor P‑state increases as well.

 

To tune the machine for more aggressive power savings, we suggest reducing the decrease time to 100 ms to match the increase time, changing the increase and decrease policies to favor P-state decrease, and switching the domain accounting policy to 0 (off). We left the increase and decrease percentages as their defaults to ensure that the system PPM parameters were not completely biased toward power savings and to reduce negative performance consequences. Table 2 summarizes these changes.

Important:  Modifying any of these parameters changes the behavior of performance state handling from the out-of-box experience. Before you deploy to production servers, validate the effects of any changes in a test environment.

Table 2. Default and New PPM Parameter Values

Setting

Default value

“Aggressive” value

Time Check

100 ms

100 ms

Increase Time

100 ms

100 ms

Decrease Time

300 ms

100 ms

Increase Percentage

30 %

30 %

Decrease Percentage

50 %

50 %

Domain Accounting Policy

0 (On)

1 (Off)

Increase Policy

0 (Ideal)

1 (Single)

Decrease Policy

1 (Single)

0 (Ideal)

 

These parameters can only be set using the powercfg.exe command-line tool, which is installed by default to the Windows\System32 folder on Windows Server 2008. The commands to change the P-state settings by using powercfg.exe are given at the end of this post.

Energy Efficiency Analysis of P-State Settings

To test the efficiency of these new power settings (henceforth called “Aggressive” settings), we performed a set of benchmark runs on a four-socket quad-core server. Table 3 gives the system configuration.

Table 3. Four-Socket Quad-Core Server Configuration

System configuration

Processors

  4  quad-core 2.9-GHz

Memory

32 4-GB DDR2 667-MHz DIMMs

Disk

  4  72-GB, 15,000 SCSI

Network adapter

  2  1-GBps 

 

We ran the SPECPower benchmark with both the default settings and the Aggressive power saving settings. Figure 2 and Figure 3 show the power usage and power efficiency across different workload levels. The Aggressive settings exhibit significant power efficiency over the default settings at a majority of the load levels. The maximum power saving is achieved at 60‑percent workload level on this configuration with approximately 10‑percent improvement in power efficiency when it is compared to the default setting. There is a negligible reduction in overall throughput at utilization levels above 97%.

Power savings of Default vs. Aggressive parameters

Figure 2.  System power across varying SPECPower load levels

Efficiency of default vs. aggressive settings

Figure 3.  System power efficiency across varying SPECPower load levels

These settings were tested on commodity servers with the SPECPower workload. Your particular hardware and workload might deliver different results. Please test any parameter changes before deploying in your production environment.

 

Changing P-State Parameters with Powercfg.exe

If you decide you want to deploy the new P-state parameter settings in your environment, you’ll first need to verify that your Windows Server 2008 installation is configured to use the Balanced power policy. Verify this by going to Power Options in the Control Panel.

 

Done? Next, you need to start a command prompt with administrator privileges. Get the binary dataset that represents the current power setting settings for P‑states with the following command line:

>powercfg /getpossiblevalue sub_processor procperf 1

You should see the following:

Type: BINARY

Value: 640864000000A0860100E09304001E00000032000000

 

This value represents an encoded dataset of power policy parameters. The parameter values for this dataset can be shown with the decode command:

>powercfg /ppmperf /decode 640864000000A0860100E09304001E00000032000000

Verify that your power parameter values match the defaults shown below and in Table 1. If your parameter settings do not match these values, your Windows Server parameters may have already been reconfigured for optimal power efficiency in your environment.

Busy Adjust Threshold: 100

Time Check: 100

Increase Time: 100000

Decrease Time: 300000

Increase Percent: 30

Decrease Percent: 50

Domain Accounting Policy: 0

Increase Policy: 0

Decrease Policy: 1

Next, you need to change the parameter values to match the “Aggressive” settings described in this post. To do so, use the following command:

>powercfg /ppmperf /encode base 640864000000A0860100E09304001E00000032000000 /decreasetime 100000 /domainaccountingpolicy 1 /increasepolicy 1 /decreasepolicy 0

After executing this command, powercfg will print out a binary dataset representing the new values, like the one shown below.

640364000000A0860100A08601001E00000032000000

 

You need to apply the new dataset by using the setpossiblevalue command:

>powercfg /setpossiblevalue /sub_processor /procperf 2 binary 640364000000A0860100A08601001E00000032000000

 

Finally, use the setactive command to enable the new parameter set. No reboot is necessary for these parameters to take effect.

>powercfg /setactive scheme_balanced

 

If you want to restore the default setttings, use the setpossiblevalue command with the default dataset value (shown below), and follow it with a setactive command:

>powercfg /setpossiblevalue /sub_processor /procperf 2 binary 640864000000A0860100E09304001E00000032000000

>powercfg /setactive scheme_balanced 

 

That’s it! You’ve taken your first step to increasing energy efficiency in your datacenter. As our white paper explains, there’s even more you can do. It’s a highly recommended read for cost-sensitive administrators.

 

Thanks for reading!

 

Matthew Robben

Program Manager

Windows Server Performance Team

 

Filed under:
Greater than 64 Logical Processor support on Windows Server 2008 R2
22 November 08 01:19 AM | winsrvperf | 1 Comments   

In the past few weeks, there have been a number of new feature announcements around Windows 7 and Windows Server 2008 R2 at PDC and WinHEC conferences.  From a Server perspective, Power Management, Virtualization, and Greater than 64 processor support are considered the top three features for Windows Server 2008 R2.  I will focus on the greater than 64 processor support, given it is a new milestone for Windows, and sets up the stage for competing on much larger and higher end servers.  The support enables large scale database customers to deploy their solutions on Windows and expect good scalability numbers.  Of course, the scalability realized is highly dependent on the applications and drivers being able to scale well beyond 64 processors.  To do so, application and driver developers are strongly encouraged to read up on the greater than 64 processor work here to see what has changed, and what type of code modifications are necessary to take full advantage of this new capability.  The document describes the architecture, terminology, and goes into more details about the new APIs.

 

Ahmed Talat

Performance Manager

Windows Server Performance Team

Filed under: ,
Hyper-V and VHD Performance - Dynamic vs. Fixed
19 September 08 07:08 PM | winsrvperf | 10 Comments   

My name is Tim Litton, I work as a Program Manager within the Microsoft Windows Server team, and my particular area of focus is performance optimization for Hyper-V.

 

With the recent release of Hyper-V, customers are starting to ask us how to configure Hyper-V to get the best performance.  It’s generally recognized that there is overheard running a virtualized environment, but the question that really needs to be answered is how much?

 

With this in mind, I thought I’d share some of our recent testing of Hyper-V and how disk workloads perform when using Fixed or Dynamic VHDs.  The goal here is to provide some data that backs up the tuning guidance that can be here: http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx.

 

The following graph shows the relative performances for a number of different scenarios (with Dynamic VHD being the baseline).

 

Hyper-V VHD Performance - Dynamic vs. Fixed

 

Fixed VHD always performs better than a Dynamic VHD in most scenarios by roughly 10% to 15% with the exception of 4k writes, where Fixed VHD performs significantly better.

 

We ran 16 virtual machines when performing these tests (see “How We Tested” below) with the goal of evaluating how well Hyper-V performed in the server consolidation scenario.  Being able to consolidate a number of physical machines onto a single machine and have the virtual machines able to handle the load is a very important design goal for Hyper-V.

 

The exact result that a customer is going to see will depend on quite a few variables (e.g. how large and often the reads and writes are, how many outstanding I/O there can be at one time), so performing real-world testing is the best way to assess what impact virtualization will have.

 

Recently, QLogic  published a benchmark for I/O throughput for storage devices going through Windows Server 2008 Hyper-V (http://www.qlogic.com/promos/products/hyper-v.aspx) that closely matches the native performance, thus demonstrating Hyper-V’s ability to bring the advantages of virtualization to large-scale datacenters.

 

How We Tested

Hardware: DP DL580 G5, 16 x 2.4 GHz (Intel E7340), 16 GB RAM

 

Disk: HP P800, 25 spindles

 

Virtual Machine Setup: 16 Virtual machines, each running Windows Server 2008 Enterprise Edition (64 bit), 1 CPU, 796 MB RAM

 

Testing Software: We used IO Meter (http://www.iometer.org) to generate the workload for the I/O system, with a maximum number of 8 outstanding I/Os per virtual machine to a 100MB file.

Getting system topology information on Windows
13 September 08 12:16 AM | winsrvperf | 2 Comments   

On Windows Server 2008 and later, applications can programmatically get information about how the underlying hardware components relate to one another.  Examples include spatial locality and memory latency.  This article describes how developers can get the system topology information and use it to build scalable solutions on multi-processor and NUMA (Non-Uniform Memory Architecture) systems.

To start things off, the following is a refresher of some definitions that will be used throughout the article:

Term

Definition

Affinity Mask

A bitmask specifying a set of up to 64 processors.

Core

A single physical processing unit.  It can be represented to software as one or more logical processors if simultaneous hardware multithreading (SMT) is enabled.

Hyper-threading

Hyper-Threading Technology (HT) is Intel's trademark for their implementation of the simultaneous multithreading technology.

Logical Processor (LP)

A single processing unit as it is represented to the operating system.  This may represent a single physical processor or a portion of one.

Node

A set of up to 64 processors that are likely in close proximity and share some common resources (i.e., memory). It is usually faster to access resources from a processor within the node than from a processor outside of the node.

Proximity Domain

A representation of hardware groupings, each with a membership ID.  Associations are only made to processors, I/O devices, and memory.  The HAL provides information about processors and memory proximity domains through its interface.

 

From an application’s perspective, a physical system is composed of three components, Processors, Memory, and I/O devices.  These components will be arranged into one or more nodes interconnected by an unknown mechanism.  The following figure is one type of configuration:

        

In this example, each node contains four processors shown by squares:   Node X contains processors X0, X1, X2, and X3.  Other systems may have a different number of processors per node, which may be multiple packages or multiple cores on the same physical package.  Attached to each node is a certain amount of memory and I/O devices.  An application can programmatically identify where the different pieces of hardware are, how they relate to one another, and then partition its I/O processing and storage  to achieve optimum scalability and performance.

Application Knowledge of Processors

Applications can determine the physical relations of processors in a system by calling GetLogicalProcessorInformation with a sufficiently large buffer that will return the requested information in an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures.  Each entry in the array describes a collection of processors denoted by the affinity mask and the type of relation this collection holds to each other.  The following table outlines the type of possible relations:

Value

Meaning

RelationProcessorCore (0)

The specified logical processors share a single processor core, for example Intel’s Hyper-Threading™ technology.

RelationNumaNode (1)

The specified logical processors are part of the same NUMA node.  (Also available from GetNumaNodeProcessorMask).

RelationCache (2)

The specified logical processors share a cache.

RelationProcessorPackage (3)

The specified logical processors share a physical package, for example multi-core processors share the same package.

 

An application may be only interested in which processors belong to a specific NUMA node so it can schedule work on these processors and improve performance by keeping state in that NUMA node’s memory.  The application may also want to avoid scheduling work on processors that belong on the same core (e.g. Hyper-Threading) to avoid resource contention..

Application Knowledge of Memory

An application may also be interested in knowing how much memory is available on a specific NUMA node before deciding to make allocations.  Making a call to GetNumaAvailableMemoryNode will provide this kind of information.  An example might be an application interested in keeping data its threads are working on (e.g. some sort of a software cache) in memory that belongs to the same node hosting the processors to which the threads have their affinities set.  This way, when the data is not resident in the processor’s cache, the cost of reading and writing to the data in local memory is less expensive than accessing remote memory from another node.  When it is time to make the allocation, Windows provides the VirtualAllocExNuma API that takes in a preferred node number as parameter for determining in which node the application would like the memory to reside.  This is an example of an application choosing to allocate memory from a specific node.

Application Knowledge of Devices

Every driver loaded on the system has an associated interface that it supports and registers when the driver starts.  Storage and networking drivers are common examples.

By calling SetupDiGetClassDevs, an application sets up a list of devices supporting a particular interface and calls SetupDiEnumDeviceInfo to enumerate through the list and get a specific device entry.  Once an application knows which devices it is interested in, it can then use the device properties to identify how a particular device relates to other components in the system like processors and memory. 

1)      Call SetupDiGetDeviceRegistryProperty requesting DEVPKEY_Numa_Proximity_Domain.

2)      If a device does not have this property, then move to that device’s parent.  Repeat until either a proximity domain is found or the root device has been reached.

3)      If no proximity domain information is found , then it is possible the device locality information is not exposed.

4)      Given a proximity domain, applications can figure out which NUMA node a device belongs to by calling GetNumaProximityNode

Brian Railing

Windows Server Performance Team

Tuning Windows Server 2008 for PHP
25 July 08 07:23 AM | winsrvperf | 3 Comments   

Tom Hawthorn, Karthik Mahesh - Windows Server Performance Team

A significant percentage of web sites utilize PHP as a platform for dynamic content.  During the development of Windows 2008, Microsoft included improvements that enable PHP to run more efficiently than previous Windows releases.  This article describes how to tune Windows 2008, IIS 7.0 and PHP for environments with a single site and high concurrent user traffic.

Introduction

Windows 2008 and IIS 7.0 have new features and optimizations that allow PHP to run more efficiently and robustly.   The most significant improvement Microsoft made was to update the CGI support in IIS 7.0 to conform to the Fast CGI standard.  Fast CGI saves a process create/destroy operation per request by pooling and reusing multiple worker processes.  This translates into significant performance improvements for PHP applications on Windows.

By default, the Fast CGI support in IIS 7.0 is configured to be conservative when allocating resources in order to best support the scenario where hundreds of web sites are running on a single physical server.  This was thought to be the most important deployment environment for PHP on Windows.  This article describes the differences between tuning a Windows server for a multi-site scenario versus a single site scenario assuming high overall traffic.

Multi-site versus Single-site Web Servers

Let’s talk about two typical environments for web servers where resource management and performance become top concerns: the “web hosting” environment and the “enterprise” environment.

In web hosting environments servers host hundreds of sites on a single physical machine.  There are dozens of companies that sell pre-packaged web sites for under $20 per month.  They achieve cost-efficiency by deploying hundreds of sites per single server machine and they place limits on the amount of traffic that each site can service.  Perhaps it is not much of a surprise that a huge percentage of web sites on the internet run on shared hardware.  Individually, hosted sites are low traffic but the aggregate traffic adds up to some serious load.  Administrators must take care to isolate the sites from each other for security reasons and must ensure that no single site can consume all the resources on the machine.  The default configuration values in Windows 2008/IIS 7.0 are optimized for the web hosting environment.

The enterprise environment is virtually the polar opposite from the hosted environment from the perspective of how software should manage resources.  Instead of strictly limiting and aggressively reclaiming the resources per-site, an administrator wants to give all of the machine’s resources to a single site.  Administrators achieve scale and robustness by load balancing the internet traffic across multiple machines serving the same web site content.  This article describes how to tune Windows 2008 and IIS 7.0 for the enterprise environment.

Request Concurrency

Web requests are made from the user’s web browser to a web server.  The server receives the request, processes it and sends back some data.  An HTTP request is usually small, maybe a few hundred bytes.  However, the response may be large and the ephemeral memory required to generate the response can be even larger.  During the period in which the web page is executing code or waiting for a response from somewhere else (i.e. a database, disk, another web site, etc…) the memory associated with the web request cannot be released.  Once the request is completed usually memory can be released with the exception of cached items.

I refer to the term “request concurrency” to describe the number of requests being processed on a web server at any given moment in time.  As average request concurrency grows on a web server, so does the average memory utilization.  In order to conserve memory a web server can limit request concurrency by queuing new incoming requests rather than servicing it immediately if the number of in-flight requests exceeds some limit.  This approach has the side effect of increasing latency because user requests may need to wait for in-flight requests to complete before they are handled.

Increasing the Default Concurrency Limits

On Windows 2008, an HTTP request will be handled by multiple software component layers beginning with the network stack and travelling up into IIS and then sometimes into third party technologies such as PHP.  Each layer will perform some work on the HTTP request before handing the request on to the next layer.  Each time a layer receives a new HTTP request, it has the option of queuing it or processing it.  Therefore, increasing concurrency limits involves modifying configuration associated with multiple layers.  This section describes configuration parameters in http.sys, IIS 7.0, FastCGI and PHP.

 

IIS 7/queueLength

Description:

This parameter controls the maximum number of requests that IIS 7.0 will allow to be queued simultaneously.  It allows the system to be more robust in handling spikes in request concurrency beyond the configured limits.

 

Normally, if a web request is received by a web server and its queue is full the web server will return an HTTP error 503 (service unavailable).  Increasing the queue limit value has no impact on a web server that does not exceed its queue limits under normal conditions.  On web servers that experience occasional bursts of requests that would exceed the default queue limits, increasing the limit may allow the server to satisfy all requests without error but with a higher latency.  On web servers that are overloaded during steady-state operation increasing this value may have a detrimental effect.

 

Default Value:

1000

 

Tuned Value:

65535

 

Command Line:

appcmd.exe set set apppool "DefaultAppPool" /queueLength:65535

 

 

 

IIS 7/appConcurrentRequestLimit

Description:

This parameter controls the maximum number of in-flight requests in the IIS 7.0 layer.  This includes requests that are being processed or are queued by the CGI layer.

 

Increasing this value on a web server that never experiences more than 5000 concurrent requests should have no impact.  On web servers that receive very large numbers of concurrent requests and that have available resources during steady state load, increasing this setting will allow the server to fully utilize its memory and CPU.  Servers that are already 100% utilized may be negatively impacted by increasing concurrent request limits.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

appcmd.exe set config /section:serverRuntime /appConcurrentRequestLimit:100000

 

 

 

Http/MaxConnections

Description:

This parameter controls the maximum number of concurrent TCPIP connections that HTTP will allow.

 

By default, only 5000 concurrent TCPIP connections are allowed by the HTTP driver in Windows.  There is typically only one outstanding HTTP request per connection, therefore increasing any other concurrency limit is pointless unless the maximum number of concurrent connections is also increased.  Each connection maintained by Windows will use some kernel memory and requires some CPU to maintain state.  I don’t recommend increasing this limit on 32 bit machines because of the limited kernel address space.

 

Default Value:

5000

 

Tuned Value:

100000

 

Command Line:

reg add HKLM\System\CurrentControlSet\Services\HTTP\Parameters /v MaxConnections /t REG_DWORD /d 1000000

 

 

 

FastCGI/Php Concurrency

Description:

This is actually two parameters, the first is the maximum concurrent requests and the second is the number of requests that can be executed by a fast CGI process before the process is recycled. 

 

The CGI model requires only a single concurrent request per pooled process.  So the max instances parameter tells IIS how many processes to start up.  Each process will consume significant resources on the server so the initial recommendation of 32 is somewhat conservative.  Increasing the number of requests that each process can handle before being recycled merely decreases the rate of process creation/destruction and reduces the average CPU required to process each request.

 

Default Value:

4/200

 

Tuned Value:

32/10000

 

Instructions:

1.       notepad %windir%\system32\inetsrv\config\applicationhost.config

2.       find the "fastCGI" element, change it to the following (assuming php-cgi.exe is in c:\php)

 

<fastCgi>

  <application fullPath="C:\PHP\php-cgi.exe" instanceMaxRequests=“10000" maxInstances="32">

    <environmentVariables>

      <environmentVariable name=”PHP_FCGI_MAX_REQUESTS” value=”10000”/>

    </environmentVariables>

  </application>

</fastCgi>

 

Conclusion

Tuning a Windows 2008 machine for PHP performance in enterprise environments is all about increasing the default concurrency limits.  Remember, if you try out some of the tunings in this article make sure to test the effects of the changes in a controlled environment before deploying them to your front line servers.   Increasing the concurrency limits will generally have the effect of increasing the steady state memory utilization and CPU if concurrency is a bottleneck on your system.  If you don’t have enough memory or your CPU is already fully utilized, don’t increase the concurrency limits!  Finally, the tuned values in this article are values that I found empirically in my own test environment.  They may or may not be the right values for your environment so play around with them to find out what works for you.  Happy tuning!

Designing Applications for High Performance - Part III
26 June 08 12:44 AM | winsrvperf | 4 Comments   
 
Rick Vicik - Architect, Windows Server Performance Team

 

The third, and final, part of this series covers I/O Completions and Memory Management techniques.  I will go through the different ways to handle I/O completions with some recommendations and optimizations introduced in Vista and later releases of Windows.  I will also cover tradeoffs associated with designing single and multiple process applications.  Finally, I will go through memory fragmentation, heap management, and provide a list of the new and expanded NUMA topology APIs.   

 

Some common I/O issues

It is recommended to use asynchronous I/O to avoid switching threads and to maximize performance.  Asynchronous I/O is more complex because it needs to determine which of the many outstanding I/Os completed.  Those I/Os may be to different handles, mixed file and network I/O, etc.  There are many different ways to handle I/O completion, not all of which are suitable for high performance (e.g. WaitForMultipleObjects, I/O Completion Routines).  For highest performance, use I/O Completion Ports.  Prior to Vista, scanning the pending overlapped structures was necessary to achieve the highest performance, but the improvements in Vista have made that technique obsolete.  However, it should be noted that an asynchronous write can block when extending a file and there is no asynchronous OpenFile. 

 

The old DOS SetFilePointer API is an anachronism.  One should specify the file offset in the overlapped structure even for synchronous I/O.  It should never be necessary to resort to the hack of having private file handles for each thread.

 

Overview of I/O Completion Processing

The processor that receives the hardware interrupt runs the interrupt service routine (ISR).  Interrupts are either distributed across all processors or the interrupts from a specific device can be bound to a particular processor or set of processors.  The ISR queues a DPC (usually to the same processor, otherwise an IPI is required) to perform the part of I/O completion processing that doesn’t need access to the user address space of the process that issued the I/O.  The DPC queues a “Special Kernel APC” to the thread that issued the I/O to copy status and byte count information to the overlapped structure in the user process.  In the case of buffered I/O, the APC also copies data from the kernel buffer to the user buffer.  In the case of I/O Completion Routines (not ports), the “Special Kernel APC” queues a user APC to itself to call the user-specified function.  Moreover, prior to Vista every I/O completion required a context-switch to the thread that issued the I/O in order to run the APC.

 

These APCs are disruptive because as the number of processors and threads increases, the probability that the APC will preempt some other thread also increases.  This disruption is less likely to happen by fully partitioning the application.  That includes having per processor threads and binding interrupts to specific processors. 

 

What’s new for I/O in Vista and Above

·         Ability to flag an I/O as low priority.  This reduces the competition between background and foreground tasks, and improves I/O bandwidth utilization.  Low priority IO is exposed via SetPriorityClass PROCESS_MODE_BACKGROUND_BEGIN, also by NtSetInformationProcess(process,ProcessIoPriority,...

·         There are no disruptive APCs running when using I/O Completion Ports.  Also, this can be accomplished for Overlapped structs if the user locks them in memory by using the SetFileIoOverlappedRange call.

·         Ability to retrieve up to ‘n’ I/O completions with a single call to GetQueuedCompletionStatusEx. 

·         The option to skip setting the event in the file handle and skip queuing a dummy completion if the I/O completes in-line (i.e. did not return PENDING status).  These can be done by making a call to SetFileCompletionNotificationModes. 

·         The Dispatcher lock is not taken when a completion is queued to a port and no threads are waiting on that port.  Similarly, no lock gets taken when removing a completion if there are items in the queue when GetQueuedCompletionStatus is called because again the thread does not need to wait for an item to be inserted.  If the call to GetQueuedCompletionStatus was made with zero timeout, then no waiting takes place.  On the other hand, the lock is taken if queuing a completion wakes a waiting thread or if calling GetQueuedCompletionStatus results in a thread waiting.

 

I/O Completion Port Example

Let’s take an example where the main thread loops on GetQueuedCompletionStatus and calls the service function which was specified when the I/O was issued (passed via an augmented Overlapped structure).  The service functions issue only asynchronous I/O and do not wait, therefore the only wait in the main thread is really on the call made to GetQueuedCompletionStatus.  The following are some examples of “events” whose completion we wait on and suggestions on what to do next once they complete:

 

-          If the completion is for a new connection establishment, set up a session structure and issue an asynchronous network receive. 

-          If the completion is for a network receive, parse the request to determine the file name and issue a call to TransmitFile API. 

-          If the completion is for a network send, log the request and issue an asynchronous network receive. 

-          If the completion is for a user signal (from PostQueuedCompletionStatus), call the routine specified in the payload.

 

The timeout parameter on GetQueuedCompletionStatus (GQCS) can cause it to wait forever, return after the specified time, or return immediately.  Completions are queued and processed FIFO, but threads are queued and released LIFO.  That favors the running thread and treats the others as a stack of “filler” threads.  Because in Vista the Completion Ports are integrated with the thread pool and scheduler, when a thread that is associated with a port waits (except on the port) and the active thread limit hasn’t been exceeded, another thread is released from the port to take the place of the one that waited.  When the waiting thread runs again, the active thread count of the port is incremented.  Unfortunately, there is no way to “take back” a thread that is released this way.  If the threads can wait and resume in many places besides GQCS (as is usually the case), it is very common for too many threads to be active.

 

PostQueuedCompletionStatus allows user signals to be integrated with I/O completion handling which allows for a unified state machine.

 

Characteristics of I/O Completion Ports

An I/O Completion Port can be associated with many file (or socket) handles, but not the reverse.  The association cannot be changed without closing and reopening the handle.  It is possible to create a port and associate it with a file handle using a single system call, but additional calls are needed to associate a port with multiple handles. 

 

While you don’t need an event in the Overlapped structure when using Completion Ports because the event is never waited on, if you leave it out, the event in the file handle will be set and that incurs extra locking.

 

In Vista, applications that use Completion Ports get the performance benefit of eliminating the IO Completion APC without any code changes or even having to recompile.  This is true even if buffered IO is used.  The other way to get the benefit of IO Completion APC elimination (locking the overlapped structure) requires code changes and cannot be used with buffered IO.

 

Even if the I/O completes in-line (and PENDING is not returned), the I/O completion event is set and a completion is queued to the port unless the SetFileCompletionNotificationModes option is used.

 

if( !ReadFile(fh,buf,size,&actual,&ioreq)){

    // could be an error or asynchronous I/O successfully queued

    if( GetLastError() == ERROR_IO_PENDING ) {

      // asynchronous I/O queued and did not complete “in-line”

    } else {

      // asynchronous I/O not queued or was serviced in-line and failed

    }

} else {

    // completed in-line, but still must consume completion

    // unless new option specified

}

 

 Memory Management Issues

When designing an application, developers are often faced with questions like - Should the application be designed with a single process or multiple processes?  Should there be a separate process for each processor or node?  In this section, we try to answer some of these questions while providing the advantages and disadvantages for each approach.

 

Advantages for designing applications with multiple processes include isolation, reliability and security.  First, an application can take advantage of more than 4GB of physical memory because each process can use up to 2GB for user-mode data.  Second, if memory is corrupted by bad code in one of the processes, the others are unaffected (unless shared memory is corrupted) and the application as a whole does not need to be terminated.  Also, separate address spaces provide isolation that can’t be duplicated with multiple threads in a single process.

 

Some disadvantages of using multiple processes include higher cost of a context switch compared to a thread switch in the same process due to the TLB getting flushed.  Also there are possible performance bottlenecks introduced by the mechanism chosen for Inter-Process Communication (IPC).  Examples of IPC include RPC, pipes, ALPC, and shared memory, so it is important that the right kind of IPC is chosen. Some estimates for round trip cost to send 100 bytes via RPC: 27,000 cycles, local named pipes: 26,000 cycles, ALPC: 13,000.  IPC via shared memory is the fastest but it erodes the isolation benefit of separate processes because bad code can potentially corrupt data in the shared memory.  Also with shared memory it is not always possible to use the data “in-place” and copying incurs an added cost of 2.5 cycles per byte copied.

 

Advantages for designing applications with a single process include not needing cross-process communication, cross process locks, etc.  Single process application can also approximate some of the advantages associated with multiple processes via workarounds.  For instance, exception-handing can trap a failing thread making it unnecessary to terminate the entire process.  The 2GB user virtual address limit is gone on x64 and can be worked around to some degree on 32bits using Address Windowing Extension (AWE) or the 3GB switch to change the user/kernel split of the 4GB virtual address space from 2:2 to 3:1. 

 

Shared Memory Issues

Shared memory is the fastest IPC but sacrifices some of the isolation that was the justification for using separate processes.  Shared memory is secure to outsiders once set up, but the mechanism by which the multiple processes gain access to it has some vulnerability.  Either a name must be used that is known to all the processes (which is susceptible to “name squatting”) or else a handle must be passed to the processes that didn’t create the shared memory (using some other IPC mechanism). 

 

Managing updates to shared memory:

1. Data is read-write to all processes, use cross-process lock to guard data or lock-free structures. 

2. Data is read-only to all but 1 process which does all updates (w/o allowing readers to see inconsistencies)

3. Same as 2 but kernel does updates

4. Data is read-only, unprotect briefly to update (suffering TLB flush due to page protection change).

 

Global Data defined in a DLL is normally process-wide but the linker “/SECTION:.MySeg,RWS” option can be used to make it system-wide if that is what is needed.  Just loading the DLL causes it to be set up as opposed to the usual CreateSection/MapViewOfSection APIs.  The downside is that the size is fixed at compile time.

 

Memory Fragmentation – What is it?

Fragmentation can occur in the Heap or in the process Virtual Address Space.  It is a consequence of the “best fit” memory allocation policy and a usage pattern that mixes large, short-lived allocations with small, long-lived ones.   It leaves a trail of free blocks (each too small to be used) which cannot be coalesced because of the small, long-lived allocations between them.  It cannot happen if all allocations are the same size or if all are freed at the same time.  Avoid fragmentation by not mixing wildly different sizes and lifetimes.  Large allocations and frees should be infrequent and batched.  Consider rolling your own “zone” heap for frequent, small allocations that are freed at the same time (e.g. constructing a parse tree).  Obtain a large block of memory and claim space in it using InterlockedExchangeAdd (to avoid locking).  If the zone is per-thread, there is no need for even the interlocked instruction.  Use the Low Fragmentation Heap whenever possible.  It is NUMA-aware and lock-free in most cases.  It replaces the heap look-asides and covers up to 16KB allocations.  It combines the management of free and uncommitted space to eliminate linear scans.  It is enabled automatically in Vista or by calling HeapSetInformation on the heap handle.

 

Best practices when managing the Heap

·         Don’t use GlobalAlloc, LocalAlloc, WalkHeap, ValidateHeap, or LockHeap.  GlobalAlloc and LocalAlloc are old functions which may take an additional lock even if the underlying heap uses lock-free look-asides.  The WalkHeap and ValidateHeap functions disable the heap look-asides.

 

·         Don’t “shrink to fit” (i.e. allocate a buffer for largest possible message, then realloc to actual size) or “expand to fit” (i.e. allocate typical size, realloc ‘n’ bytes at a time until it fits).  These are fragmentation-producing allocation patterns and realloc often involves copying the data.

 

·         Don’t use the heap for large buffers (>16KB).  Buffers obtained from the heap are not aligned on a natural boundary.  Use VirtualAlloc for large buffers, but do it infrequently, carve them up yourself and recycle them.

 

·         New in Vista: dynamic kernel virtual address allocation... no longer need to manually juggle the sizes of the various kernel memory pools when one of them runs out of space (e.g. desktop heap).

 

·         New in Vista:  Prefetch API - PreFetchCacheLine(adr,options).  The API has a large dependency on the hardware’s support for prefetch.

 

NUMA Support

APIs to get topology information (depends on hardware for the information):

 

  GetNumaHighestNodeNumber

  GetNumaProcessorNode (specified processor is on which node)

  GetNumaNodeProcessorMask (which processors are on the specified node)

  GetNumaAvailableMemory (current free memory on specified node)

 

Use existing affinity APIs to place threads on desired nodes.

 

Memory is allocated on node where thread is running when the memory is touched for the first time (not at allocation time).  For better control over where memory is allocated, use new “ExNuma” versions of the memory allocation APIs.  The additional parameter specifies node.  It is a preferred, not absolute specification and it is 1-based because 0 signifies no preference.

 

  VirtualAllocExNuma(..., Node)

  MapViewOfFileExNuma(..., Node)

  CreateFileMappingExNuma(..., Node)

 

Large pages and the TLB

The Translation Look-aside Buffer (TLB) is a critical resource on machines with large amounts of physical memory.  Server applications often have large blocks of memory which should be treated as a whole by the memory manager (instead of 4KB or 8KB chunks), e.g. a database cache.  Using large pages for those items reduces the number of TLB entries, improves TLB hit ratio, and decreases CPI.  Also, the data item is either all in memory or all out.  

 

  minsize = GetLargePageMinimum();

  p = VirtualAlloc(null, n*minsize, MEM_LARGE_PAGES, ...);

Power and Hyper-V are now part of the Windows Server 2008 Tuning Guide!
17 June 08 08:58 PM | winsrvperf | 3 Comments   

The guide has been updated with sections on Power and Hyper-V guidelines and best practices.  Check out the updated Tuning Guide and tell us what you think by following the feedback link at the top of the Tuning Guide.  We look forward to hearing from you!

Ahmed Talat
Performance Manager
Windows Server Performance Team

Filed under: , ,
Designing Applications for High Performance - Part II
21 May 08 02:48 AM | winsrvperf | 1 Comments   

Rick Vicik - Architect, Windows Server Performance Team

The second part of this series covers Data Structures and Locks. I will provide general guidance on which data structures to use under certain circumstances and how to use locks without having a negative impact on performance.  Finally, there will be examples covering common problems/solutions and a simple cookbook detailing functional requirements and recommendations when using data structures and locks.

 

In order to avoid cache line thrashing and a high rate of lock collisions, the following are suggested guidelines when designing an application:

 

·         Minimize the need for a large number of locks by partitioning data amongst threads or processors. 

·         Be aware of the hardware cache-line size because accessing different data items that fall on the same cache-line is considered a collision by the hardware. 

·         Use the minimum mechanism necessary in data structures.  For example, don’t use a doubly-linked list to implement a queue unless it is necessary to remove from the middle or scan both ways. 

·         Don’t use a FIFO queue for a free list. 

·         Use lock-free techniques when possible.  For example, claim buffer space with InterlockedAdd and use an S-List for free lists or producer/consumer queues. 

 

The “ABA” problem

The “ABA” problem occurs when InterlockedCompareExchange is used to implement lock-free data structures because what is really needed is the ability to detect any change, even a set of multiple changes that restores the original value.  If 2 items are removed from an S-List and the first is replaced, just comparing values would not detect that the list has changed (and the local copy of the ‘next’ pointer from the first item on the list is “stale”).  The solution is to add a version# to the comparison so that it fails after any change, even one that restored the old value.  That gets tricky in x64 because the size of a pointer is the same as the maximum size InterlockedCompareExchange.

 

The “Updating of Shared Data” problem

The primary concern when handling updates to shared data is to be aware of all the ways an item can be reached.  When removing from the middle of a doubly-linked list, an item becomes unreachable when the next and previous pointers of the adjacent items are nulled.  The tricky part is that a thread may have made a local copy of the “next” pointer from the previous item before it was nulled but hasn’t yet accessed that “next” item.  A “lurking reader” must make its presence known to others by using a refcount or by taking per-item locks in a “crabbing” fashion as it traverses the list.  The refcount can’t be on the target item because the lurking reader hasn’t gotten there yet.  It must be on the pointer that was used to get there or else logically apply to the list as a whole.  The “crabbing” technique is usually too expensive in most cases.  It is almost always necessary to have a lock which guards the list.

 

True Lock free Operations

The simple test for a true “lock-free” operation is whether or not a thread can die anywhere during the operation and not prevent other threads from doing the operation.  It is commonly thought that replacing a pointer with a “keep out” indicator is better than using a lock, but it is really just folding the lock and pointer into the same data item.  If the thread dies after inserting the “keep out” indicator, no other thread can get in.

 

Guidelines for good locking practices and things to avoid

 

·         Hash Lookup (cache index or “timer wheel”)

A good general practice is to use an array of synonym list heads with a lock per list (or set of lists), where locks fall on different cache lines.  This design does not increase the number of lock acquires per lookup compared to the single lock implementation while reducing the lock collision rate.  Use a doubly-linked list for synonyms to support removal from the middle (if lookup always precedes removal, a singly-linked list can be used).  If the access pattern is mostly lookups with occasional insert/remove, use a Read/Write lock with a “writer priority” policy to provide fairness.  If the entry is not found, then an exclusive lock is needed to insert it.  If the lock doesn’t support atomic promotion, then must drop, reacquire and rescan.  To avoid memory allocations (and possibly waiting) while holding lock, allocate the new block between the dropping and re-acquiring of the lock.

 

·         N:M Relationship

Suppose a company allows a many-to-many relationship between employees and projects.  That requires a set of intersecting lists to represent the relationships and a single lock can probably be used to guard it.  That solution might be sufficient if most access is read-only, but will become a bottleneck if updates are frequent.  A finer-grain solution is to have a lock on each instance of project and employee.  Adding or removing a relationship requires taking the 2 intersecting locks, which is slightly more than the single lock implementation.  A number of optimizations are possible to avoid lock-ordering issues:  Deferred removal of intersection blocks, one-at-a-time insertion & removal, InterlockedCompareExchange for removal.

 

·         Lock Convoy

FIFO locks guarantee fairness and forward progress at the expense of causing lock convoys.  The term originally meant several threads executing the same part of the code as a group resulting in higher collisions than if they were randomly distributed throughout the code (much like automobiles being grouped into packets by traffic lights).  The particular phenomenon I’m talking about is worse because once it forms the implicit handoff of lock ownership keeps the threads in lock-step.

 

To illustrate, consider the example where a thread holds a lock and it gets preempted while holding the lock.  The result is all the other threads will pile up on the wait list for that lock.  When the preempted thread (lock owner at this time) gets to run again and releases the lock, it automatically hands ownership of the lock to the first thread on the wait list.  That thread may not run for some time, but the “hold time” clock is ticking.  The previous owner usually requests the lock again before the wait list is cleared out, perpetuating the convoy.

 

·         Producer/Consumer Implementation

The first thing you should ask yourself when setting up a producer/consumer arrangement is what the gain achieved by handing off the work?  The amount of processing done per hand-off must be significantly greater than the cost of a context switch (~3k cycles direct cost, 10x or more indirect cost, depending on cache impact).  The only legitimate reasons for handing off work to another thread are:  If the isolation of a separate process is needed or if preemption is needed rather than cooperative yielding. 

 

The following are code snippets for a Producer/Consumer implementation which will be used to point out things to avoid when doing the design.

 

Producer

WaitForSingleObject(QueueMutex,...);

InsertTailList(&QueueHead, Item);

SetEvent(WakeupEvent);

ReleaseMutex(QueueMutex);

 

Consumer

for(;;) {

  WaitForSingleObject(WakeupEvent,...);

  WaitForSingleObject(QueueMutex,...);

  item = RemoveHeadList( &QueueHead);

  ReleaseMutex( QueueMutex);

  ... process item ...

}

 

1.       Don’t wake the consumer while holding the lock it will need.  In our example, the Producer is holding the QueueMutex when it made the SetEvent call.

2.       Don’t make multiple system calls to process each item even when no scheduling events occur (3 system calls in the producer, 3 in the consumer in this case). 

3.       Using the WaitForMultiple(ALL) call on both the QueueMutex and WakeupEvent seems like a clever solution because it avoids the extra context switch and combines the two WaitForSingleObject system calls into a single call.  It really isn’t much better because each time an event in the set is signaled, the waiting thread is awakened via APC to check the status of all events in the set (resulting in just as many context switches).

4.       Amortize the hand-off cost by not waking the consumer until ‘n’ items are queued, but then a timeout is needed to cap latency. 

5.       It is better to integrate background processing into the main state machine.   

6.       The consumer should be lower priority than the main thread so it runs only when the main thread has nothing to do.  Consumer should not cause preemption when delivering the “finished” notification.

7.       In the example above, consider using PostQueueCompletionStatus rather than SetEvent API.  The latter boosts the target thread’s priority by 1 which may cause it to preempt something else.   

8.       Don’t use the Windows GUI message mechanism for producer/consumer queues or inter-process communication.  Use it only for GUI messages. 

 

Synchronization Resources

 

Events are strictly FIFO, can be used cross-process by passing the name or handle, and they have no spin option or logic to prevent convoys from forming.  When an event is created, its signaling mode can be set to either auto-reset or manual-reset.  Each signaling mode dictates how APIs like SetEvent and PulseEvent interact with threads.

 

Ø  The SetEvent call on auto-reset events will allow a single thread to pass through if no others are waiting on the event.  For manual-reset events, the call will allow all threads waiting on the event to pass and “keep the door open” until the event is explicitly reset. 

Ø  The PulseEvent call on auto-reset events will allow a single thread to pass through but only if there is one waiting.  On manual-reset events, the call will allow all threads waiting on the event to pass and “leaves the door closed”.

 

Note:  The SignalObjectAndWait call allows a thread to signal another and wait without being preempted by the signaled thread.  The SetEvent call typically boosts the priority of the signaled thread by 1, so it is possible for the signaled thread to preempt the signaling thread before it waits.  This pattern is a very common cause of excess context switches. 

 

Semaphores are also strictly FIFO, can be used cross-process, and they have no spin option and no logic to prevents convoys from forming.  When it is necessary to release a specific number of threads, but not all, using the ReleaseSemaphores call is recommended.

 

Mutexes exposed to user mode supports recursion and have the current owning thread ID stored in it, whereas the Event and the Semaphore do not.

 

The ExecutiveResource is a reader/writer lock available in user and kernel mode.  It is FIFO, has no spin or anti-convoy logic, but does have a TryToAcquire option and has anti-priority-inversion logic (attempts to boost the owner(s) if acquire takes over 500millisec).  There are options regarding promotion to exclusive, demotion to shared, and whether readers or writers have priority (but not all options are available in user and kernel versions).

 

The CriticalSection is the most common lock in user-mode.  It is exclusive, supports recursion, has TryToAcquire and spin options and is convoy-free because it is non-FIFO.  Acquire and Release are very fast and do not enter the kernel if there is no contention.  It cannot be used across processes.  The event is created on first collision and all critical sections in a process are linked together for debugging purposes which requires a process-wide lock on creation and destruction (but there is an option to skip the linking).    

 

Read/Write locks (Slim Read/Write Lock (SRWLock) in user mode and PushLock in kernel mode)

The new, lighter-weight read/write lock is available in Vista and later releases of Windows.  It is similar to the CriticalSection in that it makes no system call if there are no contentions.  It is non-FIFO and therefore convoy-free.  The data structure is just a single pointer and the initialized state is 0, so they are very cheap to create.  It does not support recursion.  To start using SRWLocks, visit MSDN for a list of the supported APIs. 

 

ConditionVariables are also new in Vista.  They are synchronization primitives that allow for threads to wait until a specific condition occurs.  They use SRWLocks or CriticalSections and cannot be shared across processes.  To start using ConditionVariables, visit MSDN for a list of the supported APIs. 

 

 

 

Lock-Free APIs (Recommended when possible)

 

Atomic updates are typically better than acquiring locks, but locking can still happen at the hardware level which causes bus traffic.  It is also possible to have excessive retries (at the hardware or software level).

 

InterlockedCompareExchange compares a 4 or 8 byte data item with a test value and if equal, atomically replaces it with a specified value.  If the comparison fails, the data item is left unchanged and the old value is returned.  The S-List is constructed using InterlockedCompareExchange.  It atomically modifies the head of a singly-linked list to support push, pop and grab operations.  It uses a version# to prevent the “ABA” problem mentioned earlier in the post.  InterlockedCompareExchange can also be used to construct a true FIFO lock-free queue.  It is basically an S-List with a tail pointer that could be slightly out-of-date.  It is not widely used because it has the limitation that elements used in the list can never be re-used for anything else.  This is often worked-around by using surrogates for the list elements.  A common pool of them can be maintained in the application.  It can grow but never shrink.

 

InterlockedIncrement, InterlockedDecrement, and InterlockedExchangeAdd are similar, but the new value is derived from the old rather than specified (add or subtract 1, add specified value... use negative to subtract).  These can be better performing than InterlockedCompareExchange because they cannot fail, so retries are eliminated in most cases.  To add variety, InterlockedIncrement and InterlockedDecrement return the new value while InterlockedExchangeAdd returns the old value.

 

The Locking Cookbook

 

The following table provides a list of proposed recommendations which shouldn’t negatively impact performance when working towards meeting a specific set of functional requirements.

 

Functional Requirement

Performance recommendation

Maintaining reference count, Driving circular buffer, Construct barrier

InterlockedIncrement / InterlockedDecrement

Claim space in buffer, Roll-up per CPU counts, Construct complex locks

InterlockedExchangeAdd

8-byte mailbox, S-List * Queue

InterlockedCompareExchange

Free List or Producer/Consumer work queue

S-List

Implement a true FIFO with no need to reverse

Lock free Queue

List that supports traversal and/or removal from middle

Conventional lock (if >70% read, use Reader/Writer lock)

 

Least Recently Used (LRU) list

Use “clock” algorithm or deferred removal

Tree

Use lock sub-tree or “crab” downward like traversing a linked list

Table 1:  A locking cookbook

 

Conclusion

 

In conclusion, the following are common guidelines for designing applications for high performance while using locks and data structures to ensure data integrity through the different available synchronization mechanisms.

 

·         Minimize the frequency of lock acquires and hold time.  Don’t wait on objects, do IO, call SetEvent, or an RPC while holding a lock.  Don’t call anything that may allocate memory while holding a lock.  Note that taking a lock while holding a lock (i.e. nesting locks) inflates hold time on the outer lock.

  

·         Use a Reader/Writer lock if >70% of operations take the lock shared.  It is incorrect to assume that any amount of shared access will be an improvement over an exclusive lock.  Exclusive operations can be delayed by multiple shared operations even if alternating shared/exclusive fairness is implemented. 

          

·         Break up locks but not so that typical operations need most of them.  This is the tricky trade-off.

 

·         Taking multiple locks can cause deadlocks.  The typical solution is to always take locks in a pre-defined order, but in practice, different parts of the application may start with a different lock.  Use TryToAcquire first (it helps eliminate deadlocks because you don’t wait).  If that fails drop the lock that is held and re-acquire in the anti-convoy order.  Things can change between the drop and reacquire, so it may be necessary to recheck the data.  Not all locks have try-to-acquire option.  WaitForMultipleObject(All) is another way to solve the deadlock problem without defining a locking order (deadlocks are impossible if all locks are obtained atomically).  The expense of WaitForMultipleObject(All) is the downside.

 

·         If you find there is a need to use recursion on locks, then it means you don’t know when the lock is held.  The lack of knowledge makes it impossible to minimize the lock hold time because you don’t know when it was held.  This is a common problem with Object-Oriented design

 

NT... TTCP! Network Performance Test Tool Available
03 May 08 08:19 PM | winsrvperf | 1 Comments   

NTttcp (Windows port of Berkley's TTCP winsock based test tool) has officially gone live (http://www.microsoft.com/whdc/device/network/TCP_tool.mspx) on Microsoft.com.  NTttcp is a useful tool to help measure overall Windows networking performance with a multitude of networking adapters in different configurations.  I encourage you to install the tool today and start measuring your network throughput and efficiency.

 

Ahmed Talat
Performance Manager
Windows Server Performance Team

Designing Applications for High Performance - Part 1
25 April 08 05:40 AM | winsrvperf | 1 Comments   

Rick Vicik - Architect, Windows Server Performance Team

 

Now that processors won’t be getting dramatically faster each year, application developers must learn how to design their applications for scalability and efficiency on multiple processor systems. I have spent the last 20 years in SQL Server development and the Windows Server Performance Group looking into multi-processor performance and scalability problems.  Over the years, I have encountered a number of recurring patterns that I would like to get designers to avoid.  In this three part series, I will go over these inefficiencies and provide suggestions to avoid them in order to improve application scalability and efficiency.  The guidelines are oriented towards server applications, but the basic principles apply to all applications.

 

The underlying problem is processors are much faster than RAM and need hardware caches or else they would spend most of their time waiting for memory access.  The effectiveness of any cache depends on locality of reference.  Poor locality can reduce performance by an order of magnitude, even with a single processor.  The problem is worse with multiple processors because data is often replicated in different caches and updates must be coordinated to give the illusion of a single copy (performing the magic of cache coherency is hard).  Also, applications might generate information that needs sharing across processors, which can overload the interconnect mechanism (e.g. bus) and slow down all memory requests, even for “innocent bystanders”.    

 

The following are some of the common pitfalls that can hurt overall performance:

·         Using too many threads and doing frequent updates to shared data.  This results in a high number of context switches due to lock collisions when several threads try to update the protected data. 

·         Cache effectiveness is reduced because thread data seldom has enough time in the cache before getting pushed out of the cache by other threads.

 

These are some of the things application designers can do to reduce the problem:

·         Minimize the need to have multiple threads update shared data through data partitioning across processors and minimize the amount of information that must cross boundaries (OO-design and the desire to have context-free components often results in “chatty” interfaces).

·         Minimize the number of context switches by keeping the number of threads close to the number of processors and minimize the reasons for them to block (locks, handing off work, handling IO-completion, etc.).  

 

To illustrate how partitioning an application would yield optimal performance compared to having shared data and lock contentions, I will use a simple, static web server scenario as an example.  The data in this scenario can be characterized as either payload (cached, previously-served pages) or control (work queues, statistics, freelists, etc.).  Figure 1 shows that the combination of updates and shared data must be avoided.  Even when the payload is read-only, the control data is usually update-intensive (e.g ref-counts).

 

 

The recommendation is to partition everything by processor or by NUMA node.  This can never be fully achieved in real applications, but it guides the design in the right direction.  Ideally, there should be per processor threads and each thread’s affinity gets set to the respective processor.  Each thread should have its own IO completion port and be event-driven.  There should be a network interface card (NIC) for each processor and the interrupts from each NIC should be bound to the corresponding processor by using the IntFilter utility on Windows 2003 or the IntPolicy utility on Windows 2008 and later.  Another alternative is using a NIC that supports Receive Side Scaling (RSS).  An intelligent network switch can perform link or port aggregation to distribute incoming requests to the multiple NICs.  Since the payload (cache of previously-served pages) is read-only, it can be read from any CPU.  Full partitioning (including disk data) would require distributing the requests to the partition that owns the subset of data.  That is beyond the capability of the network switch. 

 

Figure 2 illustrates one proposed design.  Each thread would loop on its completion port, servicing events as they occur (e.g. if a requested web page is not in cache, issue an asynchronous read to bring it in and attach the serving of that page to the IO completion).  The only updated shared data left is for the purpose of managing the cache (ref-counting, updating hash-synonym list, evicting older contents).  Frequently-updated statistical counters should also be kept per processor and rolled-up infrequently.

 

An application needs to be aware of the number of processors because it may need to distribute the load if the link or port aggregation technique isn’t good enough.  It may also need to perform a type of load balancing if the requests differ significantly in processing time.  Soft affinity (set via SetThreadIdealProcessor API) may be enough, but if the threads are hard-affinitized to processors (set via SetThreadAffinityMask API), periodic work-stealing logic may be needed to avoid some processors going idle while work queues up on others.  The handling of I/O completion gets trickier, but more details are provided later and I will explain how using Vista can help.   

 

The first part of this series will cover Threads and side effects associated with having too many active threads contending for resources or trying to update the same piece of memory.  It will also provide an overview of some of the improvements that have gone into Vista for thread handling.

 

Threading Issues

 

An application that has too many active threads is a bad thing, especially when shared data is updated frequently because locks are needed to protect the data.  When locks are taken frequently, even if the total time spent holding locks is very small, each thread runs only briefly before blocking on a lock.  By the time any thread runs again, its cache-state has been wiped out.  Also, preemption while holding a lock is more likely.  A good designer never holds a lock while making a call that could block because that inflates lock hold time.  Unfortunately, the designer doesn’t have much control over preemption and page faults, which also inflate lock hold time.

 

  

Guidelines for reducing the number of threads


Applications mainly have too many threads to simplify the code rather than to create parallelism.  The classic anti-pattern for this is handing off work to another thread and waiting for it to complete when the proper approach should be to make a function call. The exception to this rule is if the consumer needs to be in a different process or thread for isolation reasons.  But even then, the operation should always be asynchronous because the consumer may not be responding.

 

Another reason for having too many threads is not using asynchronous IO where appropriate.  It is not necessary to have “lazy-writer” or “read-ahead” threads.  Issue the IO asynchronously and handle the completion in the main state machine. 

 

Other reasons that are less under the control of the application designer are the need for separate “input handler” threads when trying to create a unified state machine to handle IO Completion Ports and RPC.  Also, using multiple components (RPC, COM, etc) results in multiple thread pools in the application because some components have their own thread-pools and each is unaware of the others when it makes its thread-throttling decisions.

 

Ideally, an application should have one thread per processor and it should never block.  In real practice, it is almost impossible to avoid calling foreign code that can block that single thread.  The compromise is to have a per processor main thread that executes the state machine and never calls code that may block.  Potentially-blocking operations must be handed-off to a thread pool so that if they do block, the main thread can still run.

 

Design recommendations for an application thread pool
 

A well designed thread pool should minimize the active “filler” threads (i.e. those released when the current thread blocks).  The application should have a single thread-pool which throttles “filler” threads by “parking” the excess ones at safe stopping points (i.e. when not holding any locks).  The worker threads should obtain their own work as opposed to having separate distribution threads (or a “listener” thread to set up a new connection).  Load balancing should not require a separate “load balancer” thread.  Idle workers should attempt to “steal” work from others (this should be kept at a minimum because it might cause cross-processor traffic). 

 

The Vista thread-pool has some improvements that can help.  The input queue is lock-free and thread-agnostic IO completion has eliminated the need for specialized “IO threads”.  It is possible to receive input from IO Completion Ports and ALPC (which eliminates the need for a separate input-handler thread).  The APIs to do this are TpBindFileToDirect and TpBindAlpcToDirect.

 

Common threading practices

 

·         Completion Port Thread Throttling

Each IO Completion Port has an active thread limit and keeps track of the number of active threads associated with the port.  The OS thread scheduler updates the active thread count when a thread blocks or resumes and it releases a “filler” thread to take the place of the blocked one.  The scheduler cannot “take back” a filler thread when the original thread resumes.  This is not an issue if threads hardly ever block on anything except the completion port.  The completion port thread-throttling mechanism cannot automatically “park” excess filler threads because it has no knowledge of the application and doesn’t know when it is safe to do so (it could be holding a lock... could detect if holding system lock, but not user lock).

 

·         Switch Threads between Requests or During  

A server application can spin up a thread to service each connection or it can maintain a pool of threads that service a larger number of connections.  Typically the switching of threads among connections occurs on a request boundary, but it could occur during the request (i.e. when it blocks).  No thread-throttling is required because no extra threads are released.  It can be done with SetJump/LongJump type user stack switching or by queuing a “resume” work packet instead of blocking inside a work packet. 

 

·         Handling Multiple Input Signals

It is often necessary for an application to handle input from multiple sources (e.g. shutdown event, registry change, IO completions, device/power notifications, incoming RPC).  Unfortunately there is no unified way to handle all of these.  The WaitForMultipleObjects API can handle some of the cases but it doesn’t cover IO Completion Port and RPC.  Also, WaitForMultipleObjects is limited to 64 objects and has a significant setup/teardown cost.  In many cases, WaitForMultiple(Any) can be replaced with a single event plus a type code in the payload data.  Another optimization is to use RegisterWaitForSingleObject to avoid “burning” a thread which sits waiting on an event.  Instead of having a separate thread that does nothing but wait on a registry change event, RegisterWaitForSingleObject can automatically queue a work item to a thread pool where it gets processed in the main loop along with IO completions, etc.

 

·         OS Thread Scheduling Basics

The thread is the unit of scheduling and the thread with the highest priority gets to run (not the ‘n’ highest where ‘n’ is the number of processors).  Applications specify “priority class” not actual priority.  Typically, a thread is boosted when readied and the boost decays as processor time is consumed.  If a thread consumes a “quantum” of a processor time without waiting, it must round-robin with equal priority threads.  Prior to Vista, determining when a thread consumed a quantum of processor time was done using timer-based sampling.  Now it is done using the hardware cycle counter and is much more accurate.

 

When a thread is readied, a search is done for a processor to run it.  First, an attempt is made to find an idle processor. While searching, the thread’s “ideal” processor is favored, followed by last processor and current processor.  NUMA nodes and physical vs. hyper-threaded processors are considered during the search (e.g. if ideal processor is not available, try other processors on the same NUMA node; if hyper-threaded, attempt to use idle physical processors first).  Secondly, if an idle processor cannot be found, attempt to preempt the thread’s “ideal” processor.  If thread’s priority is not high enough to preempt, queue it to the “ideal” processor.  A thread’s “Ideal” processor can be set using SetThreadIdealProcessor; otherwise it is assigned by the system in a way to spread the load but keep threads of the same process on the same NUMA node.

 

·         HyperThreading Specifics
The thread scheduler is hyper-threading aware and the OS uses the Yield instruction to avoid starving other virtual processors on the same physical processor when spinning.  User code that spins should use the YieldProcessor API for the same reason.  The GetLogicalProcessorInformation API can be used to get information about the relationship among cores and nodes as well as information about the caches such as size, linesize and associativity.    

 

 

Stay tuned for our next installment "Data Structures and Locking Issues" ...

Networking Adapter Performance Guidelines
18 March 08 07:46 AM | winsrvperf | 1 Comments   

Networking performance has increasingly become one of the most important factors in overall system performance. Many of the factors that affect networking performance fall under the following three categories: Network adapter hardware and driver performance, network stack performance, and the way applications interact with the network stack. We will highlight some of the more important networking adapter properties and advanced features available to yield optimal performance.

Introduction

Networking performance is measured in throughput and response time. It is also important to achieve the optimal performance operating point without over-utilizing the system resources. To reach that optimal performance point, the network adapter vendors have enhanced their hardware capabilities to improve scalability, the amount of time to process data, and built in capabilities to dynamically adjust hardware and software parameters depending on workload characteristics.

Network Adapter Hardware

Over the past decade, networking speeds have multiplied orders of magnitude to keep up with applications that have become network intensive and the load on the host processors increases for networking service routines.  In light of these changes, it has become increasingly important to consider offloading some of the tasks to the hardware and further optimize how the hardware interacts with the software to scale and improve performance.  Some of the features include Task Offload, TCP Offload, Interrupt Moderation, dynamic tuning on the hardware, Jumbo Frames, and Receive Side Scaling (RSS).  These are particularly important for the high-end network adapter that will be used in configurations requiring top performance.

Task Offload Features

The Microsoft networking stack can offload one or more of the following tasks to a network adapter that has the support for the offload capabilities.  The following are the supported offload tasks:

Checksum Offload: For most common IPv4 and IPv6 network traffic, offloading the checksum calculation to the network adapter hardware offers a significant performance advantage by reducing the number of CPU cycles required per byte and overall system performance improves.  UDP checksum offload support has been added in Windows Server 2008 and was not available in prior releases.
 

Large Send Offload: Applications that send down large amounts of data rely on the network stack to segment the large buffers into smaller Ethernet size buffers.  That size is typically 576 bytes on the Internet and 1500 bytes on LANs.  Large Send Offload (LSO) allows the coalescing of send data into 64KB segments and to offload the segmentation work to the hardware.  Offloading the work reduces the host CPU cycles and can improve overall system performance.  Giant Send Offload (GSO) is a superset of LSO which allows for send buffers coalescing into segments that are greater than 64KB.

IP Security (IPSec) Offload: Windows offers the ability to offload the encryption work of IPSec to the network adapter hardware. Encryption, especially 3 DES, has a very high cycles/byte ratio. Therefore, it is no surprise that offloading IPSec to the network adapter hardware has high performance yields depending on the scenario and the workload characteristics.

Interrupt Moderation

A simple network adapter interrupts the host processor upon receiving a packet or when sending a packet is complete. In many scenarios, where this high processor utilization, it is best to coalesce several packets for each interrupt and reduce the number of times the host processor is interrupted.  Because it is a common mistake to tune interrupt moderation for throughput and hurt response time, most network adapter vendors have implemented dynamic interrupt moderation schemes in their solutions.

Jumbo Frame Support

Supporting larger Maximum Transmission Units (MTUs) and thus larger frame sizes, specifically Jumbo Frames, will reduce the number of trips needed to send the same amount of data.  This results in a significant reduction of CPU utilization.

Receive Side Scaling

For large scale applications, being able to simultaneously process networking requests on multiple processors insures both improved performance through parallelism and CPU load distribution amongst the system processors.  Receive Side Scaling, when supported by the underlying hardware, provides this capability for TCP traffic and the technology is recommended for Web and File Server scenarios where a server is servicing requests from a large number of connections.

Network Adapter Resources

Most network adapters allow administrators to manually configure send and receive resources through the Advanced Networking tab for the adapter.  The most common ones are the receive and send buffers, which most of the time are configured to the low mark resulting in sub-optimal performance.  A small subset of adapters allow for dynamic adjustment for their networking resources so a manual configuration is unnecessary.

Network Adapter Characteristics

When choosing which network adapters to you use, you should always get 64-bit PCI-Express adapters.  Using 32-bit adapters will limit the amount of data the adapter can transfer and will provide sub-optimal performance when copying data into the adapter’s buffers.  Also, an adapter that is not PCI-E will get capped at around 7 Gigabit per seconds pure data transfer excluding headers.  This becomes a bottleneck if you are planning to use 10 Gigabit adapters in your setup and would like to get the full bandwidth out of it.

Network Adapter Tuning

The high performance features discussed earlier about network adapters are necessary for achieving best performance on most scenarios and workloads in single and multiple processor system configurations.  The following are guidelines for where to put the adapter in the system and how to best optimize the distribution of the network adapter interrupts. 

Bus Characteristics

It was mentioned earlier that it is best to place the network adapter in a 64-bit PCI-Express slot.  That guarantees the best performance.  If you will be using PCI or PCI-X slots, then you should try and place your adapter in a slot that does not share the bus with other devices in the system.  Most hardware vendors have diagrams on the inside of the servers or on their websites that describes the layout of the machine.  Sharing a bus with other devices can degrade performance because of latency on the bus and contention.

Interrupt Binding

Interrupts generated by the network adapter can be partitioned to a single or a select group of processors to improve performance.  An example would be partitioning your application threads to run on the same processor where network traffic is processed to preserve cache locality.  In order to bind an adapter’s interrupts to a select group of processors, we recommend using the Interrupt Affinity Policy tool. 

Resources

·         Windows Server 2008 Performance Tuning Guide
http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx

·         High Performance Network Adapters and Drivers
http://www.microsoft.com/whdc/device/network/NetAdapters-Drvs.mspx


Ahmed Talat
Performance Manager
Windows Server Performance Team

Windows Server 2008 - Scalability and Performance Presentation
07 March 08 12:16 AM | winsrvperf | 0 Comments   

Hi all, I thought I’d forward around a link to the WS08 performance presentation we did for the Server 2008 launch.  We cover a number of the areas/roles of the product and provide comparisons against Server 2003 – have a look!

Cheers,
Bill Karagounis
Group Program Manager
Windows Server Performance Team

Hyper-V and Multiprocessor VMs
29 February 08 01:43 AM | winsrvperf | 7 Comments   

Thanks for visiting our blog! I’m a development lead in the Windows Server Performance team and I led the performance effort on Hyper-V for Windows Server 2008 over the past three and a half years.

 

We’ve worked with the product team throughout the Hyper-V development cycle to deliver a competitive product and we’re excited about shipping Hyper-V RTM this year, with the Hyper-V Beta shipping in Windows Server 2008 this week!

 

Architectural Overview

 

Hyper-V ArchitectureHyper-V uses a hypervisor-based architecture and leverages the driver model of Windows for broad hardware support. The hypervisor partitions a server into containers of CPU and memory. As a micro-kernel, it provides mechanisms for inter-partition communication upon which our new high-performance synthetic I/O architecture is built. The root partition owns physical I/O devices and provides services including I/O implemented by the virtualization stack to the child partitions.

 

The virtualization stack implements emulated I/O devices such as an IDE controller and a DEC 21140A network adapter. However, it is expensive to virtualize such devices. Sending a single I/O might require multiple trips between the virtualization stack and child partition. Instead, Hyper-V exposes synthetic I/O devices that are specially designed for VM environments. These devices are attached to VMBus, which is a plug-and-play capable bus that uses shared memory for efficient inter-partition communication. The Windows guests detect the devices on VMBus and loads the appropriate drivers.

 

Hyper-V Synthetic IOSynthetic I/O in Hyper-V uses a client-server architecture with Virtualization Service Providers (VSPs) in the root and Virtualization Service Clients (VSCs) in the child. This architecture significantly reduces the cost of sending an I/O. Virtual Server customers should observe a major reduction in CPU usage in I/O-intensive loads when they migrate their VMs to Hyper-V.

 

In addition, we developed operating system enlightenments for Windows Server 2008, which make the NT kernel and memory manager smarter in VM environments, again to reduce the cost of virtualization.

 

Multi-Processor Guests

 

For this first blog post, I want to highlight one of the major performance features in Hyper-V: multi-processor virtual machines. Hyper-V supports 4P VMs for Windows Server 2008 guests and 2P VMs for Windows Server 2003 SP2 guests. For more intensive server workloads, you might consider virtualizing them in 2P or 4P VMs on Hyper-V. Of course, you should use multi-processor VMs only if the workload requires it since there is some cost to having additional processors.

 

However, operating system kernels and drivers use spin locks which do not block and spin until the lock is acquired, with the assumption that the lock is held for a short period. Virtualization breaks this assumption as virtual processors (VPs) are time-sliced. If a VP is preempted while holding a spin lock, other VPs may spin for a long time wasting CPU cycles.

 

We developed innovations in the hypervisor and Windows Server 2008 kernel to try to prevent long spin wait conditions and also to efficiently detect and handle them when they do occur. We also designed the hypervisor, including the scheduler and memory virtualization logic, to be lock-free on most critical paths to ensure good scalability on multi-processor systems.

 

As a result, Windows Server 2008 as a 4P guest scales well compared to the physical 4P system. This is one example of Windows Server 2008 as a guest and Hyper-V together providing performance advantages. We plan to continue to improving our scalability on multi-processor systems and multi-processor VMs in subsequent releases.

 

Closing Thoughts

 

Thanks for reading this far! I would encourage you to try Hyper-V Beta in Windows Server 2008, which launched this week. And take a look at the Windows Server 2008 and Virtualization web site for more information.

 

I look forward to writing more on our work on Hyper-V performance. Please add our blog to your RSS feeds!

 

Regards,

 

John Sheu

Senior Development Lead

Windows Server Performance Team

Welcome!
07 February 08 05:54 PM | winsrvperf | 0 Comments   

Welcome to the Microsoft Windows Server Performance team blog! As the Group Program Manager for the team, I’m delighted to introduce the team and provide the first post to kick off the blog.

The Windows Server Performance team is a part of the Core Operating System Division at Microsoft. Our charter is to understand and improve the performance of Windows Server. As a matter of engineering, the sort of work we do involves:

·         Measurement of performance;

·         Analysis to identify bottlenecks;

·         Identification and implementation of architectural and code changes to improve performance; and

·         To close the loop, verification that the changes we made, did what we expected them to do J

The work gives us an opportunity to see the OS as a whole, and to study the interaction between software components and between hardware and software.

We tend to focus on core scenarios and capabilities in Windows Server (other teams focus on role-specific performance), and look for ways to improve efficiency and scalability. Some of the areas we cover (and will post about) are virtualization, multi-core/multi-proc scalability, file systems (local and remote), network and disk I/O, and server power. We also plan to discuss OS and server application performance in general, and to share some of what we have learnt over the years.

We’ve already published v1.0 of our performance tuning whitepaper for Windows Server 2008 http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx - have a look and give us feedback (there’s a feedback link in the document). We will also be doing a Server 2008 performance webcast on Feb 8th to talk directly about the performance features and improvements in Server 2008.

Additionally, we’ll have some of the team on hand & presenting at the joint launch of Windows Server 2008, SQL Server 2008 and Visual Studio 2008 in Los Angeles on February 27th. See the launch site for more detail. We hope to see you there!

We are all looking forward to the day when everyone has the opportunity to use the next major release of our OS.

Thanks,
Bill Karagounis
Group Program Manager
Windows Server Performance Team

Page view tracker