Welcome to TechNet Blogs Sign in | Join | Help

HTTP.SYS / Cryptographic Services / LSASS.EXE deadlock

A recent case I had brought this issue to my attention, so I thought it useful to share the knowledge…

The problem encountered was a Windows Server 2008 x64 SP2 server running several websites was failing to start several services during startup, and attempts to logon stuck at “Applying User Settings…” indefinitely.
Starting in Safe Mode (or Safe Mode with Networking) worked fine.

By the time the case had been opened the symptom had already been removed by taking away some certificates – the server then started correctly when booted normally (adding re-adding the certificates did not reintroduce the problem).

 

From a VM image of the server in the problem state I made a complete memory dump when the server had been stuck ~10 minutes during user logon.

From the hang dump I could see the logon was stalled because one of the threads in LSASS.EXE was waiting for the Cryptographic Services service to start – an ALPC was sent to to the Service Control Manager (SCM) to poke the thread when the service was up.

SCM was in the process of starting HTTP.SYS, which held the lock for service startup (preventing the Cryptographic Services service from starting) and HTTP.SYS’s thread requires the services of LSASS.EXE…deadlock.

http-lsass-cryptsvc_deadlock

This is a classic timing issue – HTTP.SYS is slow to load (for some reason relating to the SSL binding information) and is holding the SCM lock long enough to stumble into the “Mexican stand-off”.

We can iron out the kink by putting a dependency for HTTP.SYS on Cryptographic Services (CryptSvc), so it will always wait, so when it makes a call into LSASS.EXE it is not holding the lock preventing LSASS.EXE’s dependency from starting.

This workaround is described here: Computer hangs at Applying Computer Settings or All Automatic Services Will Not Start After Reboot on Windows Server 2008

 

Does this mean that the symptom of “Automatic services failing to start” are caused by this one issue?
NO

The symptom is unfortunately very generic – there is so much going on with a system startup that there are many ways to encounter a “hang” with totally different root causes.

It is very easy to say “I am experiencing the symptoms described in KB article XXX, but the solution/workaround did not work” – this invariably means that you did not have that problem described by the article.

But if you have a server with certificates installed and SSL is configured (most probably for use by IIS), and you’re running Windows Server 2008 then it’s worth knowing about this issue and trivial workaround.

(Windows Server 2008 R2 does not have the problem, by the way.)

Posted by Paul Adams | 0 Comments

Be kind, rewind (but don’t reboot)

One very common belief I have come across is that rebooting Windows somehow “cleans” the system and returns it to normal speed after some performance degradation (and further that reinstalling the OS periodically does some magical cleaning too).

For the most part, this is complete nonsense.

Shutting down Windows will terminate all processes & services and empty the system cache, and starting from clean will cause a system initialization check (was the previous shutdown clean, does the disk need checking or a dump extracting, etc.) followed by a mass struggle for domination of various components wanting to fire up.

See the previous blog entry where I talked about contention – a system startup is a bottleneck, luckily one-off, where all the various parts of the OS plus 3rd party services will want to get started (and then often sit in an idle state for the majority of their lives).

Once this startup procedure is over and the OS is sitting at the authentication prompt (or the desktop, if the user added to the contention by wanting to logon and incur additional load with their logon processes), we now have an empty system cache.

In some (client) environments Superfetch can kick in after the system has been idle a while and start to load in files that it has observed the user requesting in a pattern – this starts to pre-populate the system cache again to remove the delay caused by disk I/O when the file is actually requested.

As file I/O is done, the Cache Manager works with the Memory Manager to keep virtual blocks of files in memory, this is to allow efficient re-use of file without incurring I/O.
Windows caches on file sections, not entire files, for efficiency – and processes reading the same file will use pointers to the same cached file sections.

So a populated system cache is a good thing – unused memory is wasted.
Pages in the system cache age, and pages will get trimmed (paged to disk or freed) based on how long ago they were accessed (i.e. a cache “hit”), if the system needs to free physical memory to satisfy requests for memory allocations then the cache is checked before processes get their working sets trimmed.

 

In some rare cases, the OS can suffer performance issues after it has been up for a while – however this is not expected or normal, and a reboot to “resolve” this is just masking a problem that should be investigated.

Performance issues most often come from… again… contention.

CPU contention – this can occur if (for example) a multi-processor system has all but 1 CPU stuck in a spinlock state, putting contention on that single CPU (if all CPUs were spinning then the system would be hung, not slow).

There are pools of worker threads in the kernel which deal with queues of work items – if some of these get into a hung state or the queues are backlogged, some, most or all threads in the system can end up in the wait state for much longer than is normal (allowing the queues to build up as time goes by, compounding the problem).

Memory resources come in different flavours for different purposes – I am not talking just physical vs virtual, but things like page table entries (PTEs), paged pool and nonpaged pool.
A system that runs low or out of PTEs will invariably hang.
A system that runs out of paged pool has likely had a leak of some kind in a driver, or the 3GB switch is in use on a busy server, and can result in performance degradation due to constant trimming, a hang or possibly even a crash if a driver requests it with “must succeed”.
Nonpaged pool is similar to paged pool, with the exception that it cannot ever be paged out to disk – typically this is used by drivers at device interrupt level – as with paged pool this can result in a severe performance drop, a hang or a crash in extreme circumstances.

Locks can end up with lots of waiters, if the current owner holds it longer than normal or there are 2 or more threads constantly grabbing and releasing one – locks are a necessary evil so that we ensure coordinated access to data structures and maintain data integrity, but can lead to scalability issues.
In particularly bad cases a deadlock can be encountered – 2 threads that each hold a lock and wait indefinitely on the other to be available, this can hang some parts of the system, tie up worker threads or slow down the entire system until it hangs.

 

When you start to look at the possible things that can go wrong in an OS, it is more surprising that it works most of the time!

So next time you believe your servers to be going slower than you expect, have a look at the nature of the performance issue:

Is it slow to logon?
- What is the message displayed during the longest delays while you wait for the desktop?

Is it slow to start new processes?
- Is it the same for all processes, or just certain ones?
- Are processes that are already running working at normal speed?
- Does Task Manager show the CPUs are all under constant high load?
- Is something grinding the disk? (If so, is it paging or file I/O?)

 

Process Explorer is a great tool for identifying where CPU time is being spent, from hardware interrupts down to individual threads in processes.

Resource Monitor (for Vista onwards) is great for real-time analysis of file I/O – which process is incurring what amount of read or write I/O against which file objects.

Performance Monitor (PerfMon) is great for logging performance to view statistical data on memory usage, CPU time, network throughput, etc.
If you use PerfMon, the valuable counters are current (and very occasionally peak), not average - very rarely is an averaged value going to be of use for performance analysis.
Current CPU and disk queue length can show a backlog of “work pending”, while System counters like Pool Paged Bytes and Pool Nonpaged Bytes only have meaning if you know what the maximum is (based on system configuration) – Free System PTEs is, however, a useful counter (keep it over 10,000 is a basic rule of thumb).

 

The next time your system feels a bit sluggish, take a look at what “sluggish” is to you, and try to identify what is currently saturated rather than reach for the “Restart” button on the Start menu.

Identifying and fixing a problem is better than ignoring or working around it.

Posted by Paul Adams | 0 Comments

It’s not what you’ve got, it’s how you use it that counts…

Soapbox time.
”If it ain’t broke, fix it until it is.”

Tuning, tweaking, trimming, optimizing… however you refer to it, you should approach it the same way.
This is not specific to Windows, software or even computers – in order to improve performance of “a system” you must first observe it to identify where the current bottlenecks lie, then find out their roots and plan around that.

Electronic circuits, car design, search algorithms, even business models all have an initial plan and will have weak spots where they can improve – reviewing the systems periodically lets you see where improvements can be made.

Some performance issues appear over time, maybe due to a scalability issue – what works for a team of 5 might be very inefficient for a team of 50, or a system that does a check on a journal will start off clean (and hence quick) but after a period of time the time to process the historical data increases.

Other performance issues can be caused by a change to the original purpose or design – extra bits bolted on, or possibly even some bits removed.

And then of course things can break, leading to all sorts of weirdness :)

 

So how does this all relate to an operating system?

People install software, which is natural as an OS with no programs to run is somewhat useless.
The presence of software by itself is no big deal, unless it makes a system-wide change or adds in background services or startup/logon processes.

This is why one of the common places to check when looking at long startup or logon times is what is scheduled to automatically start with the OS or the arrival of a user – the more that is present in this list, the more contention you have for system resources.

 

Point of Contention

This is the primary cause for performance issues – a single resource that has a bottleneck : the rate of requests coming in is greater than the capacity to service them in a timely manner.

Most commonly – disk access.
Disks are slooooooow devices, compared with other resources (including GigE-speed networks) and so multiple I/O requests to these will lead to a lot of waiting and grinding.

We don’t allow a thread or process to hog access to a disk, that would be horribly unfair and impossible to work out who should get access in the event of contention – a large file I/O would cause every other process to hang while they wait in a queue… and that queue would be getting longer as time goes by.

 

Consider a case where 2 processes want to read large files from a disk, and when there are no other I/O requests the files take 5 seconds each to read.
Process 1 requests it file, then as soon as it is done process 2 requests its file – total time taken is 10 seconds.

Compare that with both processes requesting their files at exactly the same time – the requests are now interleaved, we read a portion of the first file and then switch to the other file read for a while, then back to the first one, then back again, and so on until the files are all read in.
The switching takes time, but also we are talking about a physical disk read/write head that has to seek different parts of the disk surface – if the files are placed at different “ends” of the disk or they are fragmented then this is a lot of time being spent in locating and switching… add all that time together and it is likely to be more than 10 seconds.

The 2 requests contend with each other when they overlap, making both take longer to complete and have a combined completion time that is longer overall – so the order in which requests are made has a big impact on performance.

 

The Waiting Game

Another source of perceived performance issues is the “hang” – where something is waiting for an event or a response to some request and in the meantime is blocking something else from occurring.

Sometimes these hangs are short-lived due to the environment (e.g. packet loss on the network leading to retransmissions), sometimes they are a fixed length (e.g. caused by timeouts) and sometimes they never end (e.g. a deadlock or infinite wait).

Hangs are often caused by hooks, addons & plugins in user mode processes, and filter drivers or device drivers in the kernel.

 

Returning to the topic of contention, as this is where performance issues typically appear – at the start I mentioned identifying the bottleneck, and this is the first challenge in understanding what to upgrade/modify/synchronize.

The big 3 are CPU, memory and disk.

A CPU bottleneck is identified with something like Process Explorer – with such a tool you can identify where your processors are spending their time, including on interrupts/DPCs as well as within services.
For a process that has high CPU utilization, it is possible to drill down to the thread level, and if symbols are configured you can even see the call stacks and get an idea of what the threads are actually doing (possibly over and over again).

A memory bottleneck is often exposed through disk I/O, because Windows works with virtual memory – and if a process hits its virtual address space limit then it is more likely to just stop working or crash.
If physical memory is exhausted then we end up paging out aged memory pages to make room, before paging in the data requested (or adding more pages for filling with data) – this is where you see disk I/O as a side effect symptom.

 

A disk bottleneck is the easiest to spot, and working out what is being requested is trivial on Vista and later thanks to the Resource Monitor built into the OS - accessible from the Start menu (under Accessories/System Tools) or indirectly through Task Manager (on the Performance tab, the button at the bottom).

For legacy Windows you can always use Process Monitor which captures all I/O, but this has a hit on performance by its nature.

If you see constant requests for the PAGEFILE.SYS then it is a safe bet your issue is a lack of physical memory or dirty pages being flushed to read in virtual memory for processes.
Whether or not it is the pagefile being accessed, Resource Monitor or Process Monitor will identify the process making the requests so you can see if the behaviour is expected or not.

 

Be all that you can be

Optimizing performance is all about identifying where you are currently spending time on a frequent basis-  in a programmer’s world it may be better to shave a fraction of a second off a routine called hundreds of times per second than to reduce the one-off startup time of a process by 5 seconds.

In the same vein, optimizing the boot time for Windows is a pointless exercise as it isn’t done that often – plus for clients it is much better to use hybrid sleep, then your “back to desktop” time is under 2 seconds.

A system will not perform faster because it has loads of free memory pages, the only time this becomes an issue is if there is a sudden and very high demand for memory which exceeds the Free and Standby lists – then working sets need to get trimmed, incurring disk I/O to the pagefile, to free the memory up so it can be allocated.

 

We often get asked “which services can be disabled?” and “what registry tweaks can I use to increase performance?”, but these are impossible to answer without some kind of context for how the machines are to be used, it is also the wrong way round to tune for performance.

People use computers in different ways, and servers have different roles, user loads and usage patterns – without this knowledge a system cannot be spec’d to suit the requirements.

So, start with the purpose of the machine to have an understanding of whether its requirements will be for CPU (number crunching, multiple user sessions), memory (virtual machine host, Exchange or SQL server), disk (every scenario to different degrees).

 

Proof of concept is essential, and where appropriate you need to have or emulate a realistic user load – then you push the system until it starts to creak, and look at where the bottleneck is.
Now, implicitly you will know the answer to your questions like “how many users can I have on my Remote Desktop Server using this system spec?” and “will I benefit from RAIDx?” because you will have the opportunity to test, compare and re-test.

There are no shortcuts, noone else can tell you how the system will perform as soon as a single user is able to make demands on it – people are not predictable, plus their working styles evolve (either through need or gaining familiarity with a system).

Don’t spec based on peak load either – in a perfect world we would have an infinite number of lanes on every road so we are never in a queue, but it won’t help you get down that road any faster if there are no other vehicles on the road anyway.

 

My €0.02

WARNING: Personal opinion following…

The sweet spot for load and capacity is ~75% – you are getting value for money (ROI) and have a little room for the spikes in activity without the system falling over.

Free RAM is wasted RAM, and rebooting a system wipes it all (so don’t reboot servers unless there is a need, and use hybrid sleep for clients instead of shutting down).

Less customization for installs leads to a happier life – if you tweak because you think you will get performance, you will see it because you want to… but further down the line you might introduce other nasty problems.
So next time you want to take ownership of the entire file system on disk, or reduce the footprint of your installation, or disable services, or make registry tweaks that you’ve done for years… remember that what is true today might not be true by SP1 ;)

Posted by Paul Adams | 0 Comments

Booting from USB to install Windows 7 (take 2)

Previously I have mentioned a method to make a USB memory stick bootable in order to install Windows, and now we’ve reached General Availability of Windows 7 I can let you know of a free tool to do just this job :)

Over at the Microsoft Store there is a great walkthrough along with the download link for the Windows 7 USB/DVD Download Tool.

Enjoy :)

Posted by Paul Adams | 0 Comments

Patches – now served hot and cold

Maintenance (updates, bugfixes) for Windows has typically been through hotfix packages, but these are not as “hot” as one might expect – very often we see prompts to restart the computer after a package is installed.

The reason is fairly simple – the update replaced a file on disk but does not have any control over the processes which may have the old version loaded, and handles to those files, so by copying the new version in and setting a registry flag to do a post-reboot operation to handle cleaning up the old files, a restart is the easiest way to ensure everything is in sync.

To allow for more dynamic updates that don’t require an immediate restart, hot patching was introduced, this required a bit of a reworking of how the modules appear in memory once they are compiled.

 

Even though entire modules are present in hotfix packages, often it is just 1 or 2 functions inside that have altered.
Each function has an entry point – an address which we jump to when we call it.

The first instruction in any hot-patchable function is a 2-bye instruction: MOV EDI,EDI.
This says to copy the contents of register EDI into register EDI - i.e. do nothing.

Immediately before the entry point are 5 single-byte commands that also “do nothing” – but as they are between functions they are not ever going to be executed so they are padding between functions in the module.

 

Here is an example output of debugging CALC.EXE on Windows Server 2008 x86 and unassembling around the function SetRadix:
0:000> u calc!SetRadix-8
calc!SwitchModes+0x133
00476c76 c20c00    ret    0Ch
00476c79 cc        int    3
00476c7a cc        int    3
00476c7b cc        int    3
00476c7c cc        int    3
00476c7d cc        int    3
calc!SetRadix:
00476c7e 8bff      mov    edi,edi
00476c80 55        push   ebp

You can see the previous function was SwitchModes, its exit point is at address 00476c76 – a 3-byte instruction to return to the caller.
From 00476c79 thru 00476c7d are the 5 bytes making up the padding between the functions.
At 00476c7e you can see the entry point for the function SetRadix, and the debugger has conveniently used the symbols to reflect this.

 

So why the 2-byte instruction that “does nothing” after 5 1-byte instructions that “do nothing”?

The reason is the instruction pointer used by the CPU – once it has executed the current instruction it increments the pointer by the length of the instruction – so after executing MOV EDI,EDI the instruction pointer is incremented by 2 (rather than 1 for a NOP).

 

Okay, great, so why is the “do nothing” instruction there at all?
To allow us to dynamically load a modified version of the function somewhere else in memory, then hook the original function.

The hot patch mechanism allows us to copy a fixed version of the function somewhere in the virtual address space of the module being patched – the problem is that we don’t know where this will be, and it could be further than a near (2-byte) jump allows, we would need 5 bytes to perform a far jump.

After the fixed version of the function is in memory, we then replace the 5 bytes with a far jump to the location of our fixed function – this is safe as the instructions are never executed in normal operation, so the instruction pointer can never be looking at code we are modifying.

Then, we replace the MOV EDI,EDI command with a near jump 5 bytes backwards – from now on, future function calls will be trampolined to our fixed version seamlessly.
This is safe even if a context switch occurred immediately after a thread made a call to the original function – the instruction pointer will either be saved pointing to the original location, in which case when we resume the thread the new, fixed version of the function is called, or it gets chance to execute the dummy instruction and advance by 2, which which case the thread would resume in the original version of the function.

If we used 2 NOPs (1-byte “no operation” instructions)  instead of MOV EDI,EDI then the instruction pointer could be incremented by 1, then we replace the 2 bytes and the instruction pointer is now invalid when the thread resumes, as it would be pointing to half-way through an instruction.

 

So in the above example from CALC.EXE, pretending we loaded a fixed version of SetRadix at address 12345678, we would then replace the 5 bytes in address range 00476c79-00476c7d with a jump to the explicit address, so it would look like this:
00476c79 e9fae9ec11    jmp    12345678

e9 is the opcode for a FAR jump, followed by 32-bits to indicate the (signed) number of bytes to increase the instruction pointer after completing the current (5-byte) instruction.
fae9ec11 comes from the calculation:
destination address - current address - size of current instruction = 12345678 – 00476c76 - 5 = 11ece9fa

(Intel stores the data in reverse order which is why it gets flipped on building the instruction.)

 

Once the 5 bytes have been patched to do the far jump, the 2-byte instruction at 00476c7e gets replaced with a NEAR jump to our FAR jump:
00476c7e ebf9    jmp    calc!SwitchModes+0x136 (00476c79)

eb is the opcode for a NEAR jump, followed by 8 bits to indicate the (signed) number of bytes to increase the pointer after completing the current (2-byte) instruction.
f9 comes from the calculation:
destination address - current address - size of current instruction = 00476c79 – 00476c7e - 2 = f9

Unassembling as we did before, we can see how the code has been altered to insert the trampoline:
0:000> u calc!SetRadix-8
calc!SwitchModes+0x133
00476c76 c20c00        ret    0Ch
00476c79 e9fae9ec11    jmp    12345678
calc!SetRadix:
00476c7e ebf9          jmp    calc!SwitchModes+0x136 (00476c79)
00476c80 55            push   ebp

(The debugger shows the offset of the jump relative to the previous function as the destination is before the entry point for the current one, this is just a display quirk as the debugger is trying to make sense of something we deliberately hacked.)

 

So now someone makes a call to calc!SetRadix and the following occurs:
- the instruction pointer is set to the entry point for the function: 00476c7e
- the instruction at this address is executed (near jump to our trampoline), changing the instruction pointer to 00476c79
- the instruction at this address is executed (far jump to our modified code), changing the instruction pointer to 12345678
- our modified code is now executed at the new location, which needs to ensure it handles the same input and output as the original function

(Opcodes are well-defined, which is how the processor knows eb should be followed by 1 byte while e9 is followed by 4 bytes.)

 

This is the methodology behind the hot patching technique – a hotfix installer package would need to contain the details of what to change in memory in order to achieve this, and be instructed to perform a hot patch in addition to replacing the module on disk with the updated version.

Posted by Paul Adams | 0 Comments

Terminal Services roaming profiles & password change at logon

“If I am forced to change my password at logon, then log onto a Terminal Server, I lose all my settings.”

This is something that has cropped up for a number of customers, depending on their configuration, so here I will try to outline the settings that impact this and the reason behind what is going on…

 

On every user object in Active Directory there is a path that specifies where a central copy of a user profile is held that is to be used when logging onto a Terminal Server.
Terminal Services Profile tab > “Profile Path

In a normal scenario, where a user logs onto a Terminal Server with the above property set, the following occurs:

1. Windows checks the path specified for a folder named %USERNAME% or %USERNAME%.V2
(%USERNAME is a variable holding the logon name of the user, e.g. testuser1, not their display name, e.g. “Test User 1”)

2a. If the above path is not found (i.e. first logon), the default user profile is copied to %USERPROFILE%
(%USERPROFILE% is a variable holding the path to local copy of the user’s profile, e.g. C:\Users\testuser1)

2b. If the path is found and the user has permission to access it, then it is reconciled with %USERPROFILE% instead

 

When the user later logs off, the profile is reconciled back to the central location – so that on the next logo, wherever it is, the user will have all the changes they made to their profile.

“Reconcile” in this context means to copy up the changed contents only, not the entire profile, so that the operation is quicker.

 

Also on each user object in AD there is a flag that indicates the user must change their password at the next logon, this is most commonly used at user creation or if a helpdesk does a password reset because the user forgot it.
Account tab > “User must change password at next logon

When a user authenticates to Windows, the above flag is checked to see if they must change their password before their profile is reconciled – but the process needs to have a profile as it is running in the context of the user, so if there is no local copy of the profile for the user, the local default user profile is copied to create it.

Here is the problem – if the user does have a Terminal Services profile specified on their AD user object, following the password change it will be checked to see if it is older or newer than the copy we have locally.
As we just created the profile, it is newer than the timestamp on the central copy and so it is not reconciled – so the user sees that they have none of their personalized settings and may go through the “first run” wizards when launching Internet Explorer or Outlook.

 

Given the normal sequence of events, if there is not local copy of the user profile and they have not got a roaming profile yet, this causes no problem – this is the user’s first logon, so it is perfectly fine to have everything “OOB”.

There is a policy setting which can remove a locally cached copy of roaming profiles when the user logs off, to avoid wasting disk space:
Computer Configuration > Policies > Administrative Templates > System > User Profiles > “Delete cached copies of roaming profiles

If the above policy is enabled on the Terminal Server, every logon for the users with roaming profile has to go through a full reconciliation – so the user is basically guaranteed to hit the problem when forced to change their password during logon.

And worse, when the user logs off their profile is reconciled back to the roaming location is it is considered updated.

 

So how to avoid this problem?

Instead of specifying the unique Terminal Server profile paths on a per-user basis, there is a policy which can be applied to the Terminal Servers which specifies a UNC path where all of the users’ roaming profiles are held:
Computer Configuration > Policies > Administrative Templates > System > User Profiles > “Set roaming profile path for all users logging onto this computer

So as an example, if the Terminal Services Profile Path for AD user object testuser1 was set to “\\server1\roaming\testuser1”, the equivalent policy path would be “\\server1\roaming\%username%”.

 

As the same path can be specified in multiple locations, and Windows will use the first one it finds, it is important that the ones before this (1 and 2 below) setting are not defined.

Here is the order in which the path for a valid user profile is checked by Terminal Services:

1. GPO : Computer Configuration > Policies > Administrative Templates > Windows Components > Terminal Services > Terminal Server > Profiles > “Set path for TS Roaming User Profile

2. AD user object : Terminal Services Profile tab > “Profile Path

3. GPO : Computer Configuration > Policies > Administrative Templates > System > User Profiles > “Set roaming profile path for all users logging onto this computer

4. AD user object : Profile tab > “Profile path

Posted by Paul Adams | 0 Comments

Capturing network traffic in Windows 7 / Server 2008 R2

Previously a capture filter driver had to be loaded in order to intercept and record all the packets passing through network interfaces (think WinPcap & NetMon filter drivers).

Now, the ability to create a network trace is in-box with Windows 7 & Server 2008, without even a reboot required!

It is covered in detail over at the Network Monitor blog, but the key bits I will cover here as it’s so simple…

 

In the most basic form, this is how you start capturing all network traffic on the machine with the default settings:
netsh trace start capture=yes

An example of the output from this command:
Trace configuration:
-------------------------------------------------------------------
Status:             Running
Trace File:         C:\Users\padams\AppData\Local\Temp\NetTraces\NetTrace.etl
Append:             Off
Circular:           On
Max Size:           250 MB
Report:             Off

As you can see, the default here is a 250MB circular buffer and the file is stored in a a temp folder in the user profile.

 

To later stop recording:
netsh trace stop

This performs some cleanup operations and then reports something like this:
Correlating traces ... done
Generating data collection ... done
The trace file and additional troubleshooting information have been compiled as
"C:\Users\padams\AppData\Local\Temp\NetTraces\NetTrace.cab".
File location = C:\Users\padams\AppData\Local\Temp\NetTraces\NetTrace.etl
Tracing session was successfully stopped.

The .CAB file produced contains various configuration diagnostics files, and the .ETL file is the trace file… with a little extra.

 

NetMon 3.2 and later is able to open the .ETL file, but in order to make sense of the data you need to tweak a couple of things…

With NetMon installed, download the Network Monitor Open Source Parsers package and install it.

Launch NetMon, then click on Tools / Options and select the Parser tab.

Select the Windows parser, click the Stubs button (to toggle “Stub” to “Full”).

Click the up arrow then the down arrow, then click Save and Reload Parsers, then click OK.

 

Now you can load your .ETL files created with netsh and the conversations should be readable – if you want to save the file as a regular NetMon .CAP file, you can of course do so.

The ETL format trace will give you a system configuration summary in the first conversation, and the process name and PID associated with each frame, so it provides more than just a pure traffic trace and takes some of the guesswork out of network trace analysis.

If you need to take a trace of the system starting up, you can add “persistent=yes” on the netsh line starting the trace – as soon as you log on you can stop the trace and save the file.

Posted by Paul Adams | 0 Comments

Network layer tweaks in Windows Server 2008

KB article 951037 describes some of the new features in the OS related to the network layer, some similar to the “Scalable Networking Pack” released for Windows Server 2003 (included in SP2).

Some environments (NICs, switches, routers) do not behave well with these new features and unpredictable symptoms can crop up with no apparent pattern due to this.

The KB article mentioned above has a lot of detail on the options, and I leave it as an exercise to the reader to look up exactly what each feature does, but here is a summary of how to check your current settings and toggle them OFF.

 

NOTE: All netsh commands are executed from an elevated command prompt

To view your current settings:
netsh int tcp show global

This will display something similar to the following:
TCP Global Parameters
----------------------------------------------
Receive-Side Scaling State          : enabled
Chimney Offload State               : automatic
NetDMA State                        : enabled
Direct Cache Acess (DCA)            : disabled
Receive Window Auto-Tuning Level    : normal
Add-On Congestion Control Provider  : none
ECN Capability                      : disabled
RFC 1323 Timestamps                 : disabled

 

To see the valid settings for each option:
netsh int tcp set global /?

To disable the TCP chimney offloading feature:
netsh int tcp set global chimney=disabled

To disable the Receive Side Scaling (RSS) feature):
netsh int tcp set global rss=disabled

To disable the NetDMA feature you need to edit the registry and reboot:
Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
Name:
EnableTCPA
Type: REG_DWORD
Data: 0

 

To disable offloading features on the NICs themselves:
You may find offloading features on the properties of the NIC drivers in Device Manager, and as this is determined by the manufacturers there is no standard for naming or number of options, but look for and disable any reference to “offload”.

TCP checksum offloading is a strange beast in itself – I have had customers report better performance when this setting is turned off in the NIC properties, and even had servers stop bugchecking due to NIC driver implementation of it.

The other side effect is that network traces will show “TCP checksum invalid” for all outbound packets, because the NIC is calculating and adding the checksum after the filter driver has “captured” the packet on its way out – this often makes people nervous that they have a hardware problem.

The theory is that giving the checksum calculation to the NIC saves the CPUs some work, but personally I have never seen anything but problems from this feature and never a degradation of performance by turning it off.

 

Now – which to disable, and how to know if it helps?

This is where you have to do some constructive testing – if you are experiencing connectivity problems with a server then you should determine the frequency of the problem, or even better if there is something that makes it predictable/reproducible.

If it is just “slower than expected network throughput” then looking at a network trace for dropped/munged/duplicate packets would be a place to start, as well as seeing what the problem machines have in common (i.e. physical location on the network, is there a router or firewall in common?).

I would strongly recommend noting the current settings before making any changes, and also taking a baseline using a documented test procedure several times (to allow for variance and caching) – then make ONE change from the above options and repeat the test to see if there is a constant, noticeable impact.

For different environments, there may be different “sweet spots” so a combination of enabled & disabled features might need to be tested – there isn’t a silver bullet here, unfortunately.

Also make sure to test from multiple clients, sometimes improving performance or resolving issues for one set of clients can have a negative impact on others, in non-heterogeneous environments especially.

If a change is made and has no effect, I would recommend returning it to its default setting.

Posted by Paul Adams | 0 Comments

Windows Server 2003 (x86) tuning for performance based on role

Yes, this is rather late in the day to start talking about W2K3 as we’ve had 2 new versions of Windows Server since then, but it’s still a widespread OS and it might be interesting to understand how to make some subtle tweaks to tailor the system resources and behaviour to suit your needs.

The three typical significant roles that I encounter servers in are:
- File Server
- Domain Controller (DC)
- Terminal Server

(The colour coding here is just to help indicate which tweaks are intended for which roles, at a glance.)

There are obviously a large number of other roles such as IIS, Exchange, Hyper-V host, etc. but for each of these I think they are outnumbered vastly by the above roles so I am going to cover those – hopefully with the details of what the changes do, you can determine whether to test altering them on your other servers.

And yes, we are focusing mainly on x86 (32-bit) servers here – once you go 64-bit a lot of these changes become irrelevant as they ceiling is raised implicitly with the extra address space.

 

Get your priorities right

On the context menu of My Computer, click Properties
On the System Properties window presented, select the Advanced tab, click the Settings button under Performance
On the Performance Options window presented, select the Advanced tab

Here you will see Adjust for best performance of:
- Programs
- Background services

What this setting influences is the quantum used for thread execution – how much time they get to run on a processor without interruption from threads at the same or lower priority.

For programs to appear more responsive to the user, a shorter quantum is preferred, so more context switching occurs.
Server services prefer to run without being bothered with so many context switches, so prefer a longer quantum.

A Terminal Server hosts user sessions and has many processes directly accessed by interactive users, so should have the Programs radio button selected.

A file server or DC on the other hand has little direct user interaction, so we want to extend the quantum and optimize for Background services.

 

The other radio button selection relating to Memory Usage toggles LargeSystemCache off (tune for programs) and on (tune for system cache) – the default is enabled on Windows Server SKUs, but again Terminal Servers can be considered “multiple user desktop” servers and so would prefer to have the workstation default, to tune for programs instead.

 

Dipping in the pool

For all roles, it can be useful to have the Memory Manager more aggressive when it comes to trimming paged pool allocations – by default this occurs at the 80% watermark, but this can lead to the server being unable to satisfy requests before it gets round to cleaning up – so to reduce this watermark to 60% will make the housekeeping kick in earlier:

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Name: PoolUsageMaximum
Type: REG_DWORD
Data: 60 (decimal)

 

For Terminal Servers it is useful to have a paged pool that is as big as possible, while an algorithm at startup determines the size of the paged pool region we do have the option to indicate that we would like it to be given preference (at the cost of Page Table Entries, PTEs):

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Name: PagedPoolSize
Type: REG_DWORD
Data: ffffffff (hexdecimal)

(This is the same setting that we recommend to make if you are getting Srv 2020 events after trying the more aggressive trimming tweak above.)

 

Giving to the givers (File Server & DC specific)

When it comes to file servers and DCs specifically, we want to tune for the Server (LanmanServer) service to get some love as they will be receiving many SMB connections, this can be done through some registry tweaks:

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Name: MaxWorkItems
Type: REG_DWORD
Data: 65535 (decimal)

65,535 is the maximum you can set, and this value specifies the number of receive buffers that the Server service can allocate at any time – the default is a calculation made based on system resources during startup, so we are influencing this decision to suit our needs.

 

These values set the minimum and maximum number of preallocated connection objects respectively:

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Name: MinFreeConnections
Type: REG_DWORD
Data: 128 (decimal)

and

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Name: MaxFreeConnections
Type: REG_DWORD
Data: 1024 (decimal)

(This is the same setting that we recommend to make if you are getting Srv 2022 events.)

 

Terminal station (Terminal Server specific)

Terminal Servers act and should be treated more as “very busy clients” than servers – think about the probably of concurrent AD user logons, roaming or mandatory profile copying, files opened across the network, applications making connections to mail or database servers, and so on.

 

Resultant Set of Policy (RSoP) is useful for troubleshooting, but it can impact performance during “normal” operation, so it can be turned off by enabling the following group policy:

Computer Configuration / Administrative Templates / System / Group Policy / Turn off Resultant Set of Policy

 

Post-SP1 hotfix from KB319440 (rolled into SP2) gives control of buffering group policy reads which can improve logon times if concurrent logons are causing blocking operations when users are trying to access the same policies:

Path: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Winlogon
Name: BufferPolicyReads
Type: REG_DWORD
Data: 1

 

There is a Workstation (LanManWorkstation) service tweak which increases the number of concurrent outbound network calls:

Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanWorkstation\Parameters
Name: MaxCmds
Type: REG_DWORD
Data: 2048 (decimal)

 

Also network related, this tweak makes Explorer more responsive by cutting down on the (metadata) information queries made when browsing network shares, especially those with many, many files or folders:

Path: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Policies\Explorer
Name: NoRemoteRecursiveEvents
Type: REG_DWORD
Data: 1

Name: NoRemoteChangeNotify
Type: REG_DWORD
Data: 1

(This tweak can be pushed out to clients in a large environment as it applies to Explorer more than the concurrent user nature of Terminal Services.)

 

This is a very brief start at looking at what performance gains you might see on busy servers, or environments with slow/latent networks, or file servers with hundreds of thousands of files being browsed by multiple users.

Any of the registry values can be looked up on MSDN or TechNet if you’re interested in the “official” descriptions of what they do.

And as always, note any changes you make to server configurations and back up beforehand.

Posted by Paul Adams | 0 Comments

Installing Windows 7 from a bootable USB memory stick

During the beta testing of Windows Vista I used DVD-RW discs to burn daily builds every few days to put onto a 32-bit laptop and a 64-bit desktop machine, as there are some things you just can’t see in a virtual environment – I was impressed then that the clean system to desktop time was ~32 minutes thanks to the new “WIM” installation method.

During Windows 7 beta testing, I decided to try out a bootable USB memory stick as the installation source – I was very impressed to see the clean installation time drop to ~15 minutes.

Quick tip – don’t change your BIOS device boot order to put USB before HDD or you will get stuck in a boot loop at the first reboot until you unplug the USB device and restart again.
Instead, many PCs have the option to hit a key during POST to select a one-time boot device – the first boot sequence prepares the partition you selected and copies over the entire source data to continue installation by booting from the HDD – after this boot you don’t need the installation media any more.

Given the incredibly low cost of USB memory sticks, I have one with the 32-bit version and a separate one with the 64-bit version – I can use the extra storage for holding extra installers such as Windows Virtual PC & XP Mode, Windows Live Essentials, VPN and AV software, etc.
It is much easier to maintain the images and software on a fast, small device than to re-burn an entire DVD image which is comparatively slow and subject to scratches (or in my case being borrowed and never returned, so I have to burn a new one).

How to go about setting up a USB memory stick as a Windows installation source?
Rather than reinvent the wheel, Jeff Alexander’s blog has a perfect step-by-step guide on how to prepare a USB memory bootable device for installs so I will just refer you there.

Posted by Paul Adams | 0 Comments

Invisible Windows

This is a quick tip on how to resolve what can be a totally baffling experience, at least the first time you encounter it, it’s been around since day 1 and I just had it occur again on Windows 7 with a RemoteApp .NET application…

You launch an application and you can see it on the task bar (or just in Task Manager in the days before Windows 95) so you know it’s running, but there is no window displayed on the screen – you can select and interact with it through the keyboard, but where is it?

Assuming a single monitor for simplicity, the top-left corner is (0,0) in the 2D plane of the desktop, with the x ordinate increasing to the right and the y ordinate increasing down.

Below is an illustration of a 320x240 physical screen (the grey box) and 3 application windows rendered on the virtual desktop, and their top-left coordinates displayed:

virtual_desktop

This is to show that the virtual desktop is much bigger than the physical display, and it is possible for ordinates to go negative (otherwise windows would not be draggable off the left or top sides of the screen) – the green application window is the only one visible to the user.

Some applications keep a record of their virtual desktop properties – in particular the coordinates of the top-left corner and the x,y dimensions – which allows them to always open in the same position when the user launches them.

 

So how can windows end up on the virtual desktop in a position not rendered by the physical display?
- application writes incorrectly to the registry to store its position/size
- change of screen resolution reduces the rendered area
- number or position of physical screens changes*

In my case today, the RemoteApp application remembered its position as on the right-hand screen of a dual-display 1600x1200 setup, something like (2000,100) for its top-left corner – but I connected from a client with a single display of 1920x1200 resolution.

So the task appeared on Windows 7’s task bar as running, but I couldn’t see it.

 

* In a multi-display setup (0,0) is the top-left corner of the primary display, the default monitor used for full-screen apps, but it is perfectly valid to have additional screens above or to the left of the primary – this makes at least 1 ordinate of every pixel on these displays negative.

This is perfectly valid but something to bear in mind as I have had a couple of apps who assumed its windows and modal forms would have positive offsets, and behaved oddly if either ordinate was negative.

 

How to resolve this, if you can’t click on the title bar of the window to drag it into the display?
(This is not specific to RemoteApps, and has been a trick I have used since Windows 3.1 days.)

For Windows 95-Vista, right-click the task in the task bar and click Move, then tap a directional arrow key on the keyboard once, then move your mouse – you will find the application window is attached to the mouse cursor until you click to drop it.

For Windows 7 you do the same, but right-click the preview window itself (right-clicking the task will display the jump list, which is just “Close window” for RemoteApps).

Just make sure the window is not minimized first ;)

Alternatively, if you know where the application stores its data in the registry, you could edit the position whilst the app is not running – but there is no way to know for sure where this information is stored, or in what format (it could be a tiny part of a giant binary blob, for example).

(Remember that RemoteApps will be looking at the registry on the remote machine, too!)

Posted by Paul Adams | 0 Comments

Windows NT History

As it has just turned 16 years since the release of the first “NT” version of Windows, I thought it would be interesting to have a quick overview of the major releases and service packs to show the evolution of Windows up to today.

I have deliberately omitted some variants to avoid it being too cluttered (e.g. IA64, Home Server, Embedded, Windows for Legacy PC versions), and my exposure to NT started with 4.0 so I don’t know the dates for the service packs in 3.5 and 3.51.

The entries have been roughly grouped by common kernel, and the “Kernel” column is the version at release (build numbers omitted for brevity).

nt_history

E&OE ;)

Posted by Paul Adams | 0 Comments

Managing a standalone R2 Hyper-V from domain-joined Windows 7 client

After doing an in-place upgrade of my Windows Server 2008 SP2 Hyper-V host to bring it up to Windows Server 2008 R2, I decided to have a bash at setting up the Hyper-V management tools on my Windows 7 client.

My setup is as follows:
w2k8
- Windows Server 2008 R2 x64
- Standalone server
- Hyper-V role installed
- hosts the domain controller VM for the domain I use at home
win7
- Windows 7 Ultimate x64
- Member of the home domain

 

In the previous incarnation of the environment I use at home, the Hyper-V host was running Windows Server 2008 x64 SP2 and the client was running Windows Vista x64 SP2 – I used John Howard’s excellent blog entry to be able to administer the Hyper-V host from my Vista client, with there being no common security authority for them:

http://blogs.technet.com/jhoward/archive/2008/03/28/part-1-hyper-v-remote-management-you-do-not-have-the-requested-permission-to-complete-this-task-contact-the-administrator-of-the-authorization-policy-for-the-computer-computername.aspx

 

A note on the in-place upgrade of the server, I made sure to follow John’s other blog entry to avoid running into any gotchas with the VMs:

http://blogs.technet.com/jhoward/archive/2009/05/05/hyper-v-and-in-place-upgrade-to-windows-server-2008-r2-release-candidate.aspx

 

After the upgrade was done and I had the Windows 7 client clean installed, I installed the RSAT for Windows 7 on the client and then went through the steps detailed in John’s blog.

The only thing to note in my case: most of the server settings were carried across with the in-place upgrade, but I did need to reconfigure the DCOM and reboot.

I am pleased to say that once the instructions had been followed, the Hyper-V Manager console fires up and connects without a problem to the Windows Server 2008 R2 standalone host from the domain member Windows 7 client.

Posted by Paul Adams | 0 Comments

VM Networking Improvments in Hyper-V in Windows Server 2008 R2

On the What's New in Windows Server 2008 R2? page you can see one of the improvements for Hyper-V has been improved virtual networking performance, and specifically “offloading” has been extended through to the child partitions for those physical NICs that offer it.

I did an in-place upgrade of my home server today, from Windows Server 2008 x64 SP2 to Windows Server 2008 x64 R2 – the upgrade itself went flawlessly, and I made sure to adhere to the instructions in KB957256 (basically ensuring none of the Hyper-V VMs had any snapshots or saved states before starting).

Following the upgrade I checked my VMs started okay and connected to them through the Hyper-V Manager console without any hitch, the Integration Services were upgraded and the VMs restarted.

All looked well with the (virtual) world.

Then I tried to access a website running on one of the VMs over the (physical) network, and the browser struggled to render the page – I tried to browse the folders on a file server VM and during a file copy it decided the server was no longer reachable.

I was even unable to establish a Remote Desktop session with any of the VMs – I just got a black screen, then was disconnected after a timeout (also a TermDD

Connecting to the VMs through vmconnect.exe was fine, the machines themselves were not hung, or even laboured… so what was wrong?

The properties of the NIC inside each VMs, on the Advanced tab, had a large selection of offloading options, all enabled – once these were disabled the servers were restored to their normal (good) speed.
(The host in question is not a server-class system, and has 2 onboard NICs on the motherboard.)

The issue is the NIC or its driver on the host does not properly employ offloading – I have seen these symptoms on servers before where disabling offloading has had a remarkable improvement for networking performance (one system was even bugchecking when starting a website in IIS when TCP offloading was enabled!).

It may sound counter-intuitive, and it’s not guaranteed to be a panacea for network throughput or disconnection issues, but it might be worth checking out (and not only in Hyper-V VMs on R2) if you encounter such symptoms.

Posted by Paul Adams | 0 Comments

What's So Special About The Pool?

One of the tools we use to troubleshoot pool memory corruption is the gflags.exe option "special pool", but what exactly does it do?

In order to get to this, let's first look at what pool memory is - specifically looking here at versions of Windows before Vista (as this changed dramatically with the new memory management improvements in the NT 6.0 kernel).

Pool memory comes in 2 flavours; paged and nonpaged - the only difference being that the first type can be paged out to disk if the physical memory it is using is needed and has not been touched in a while.


The following examples are all based on the x86 architecture using system defaults (i.e. no /3GB switch)...

The regions of pool memory are subsets of the kernel virtual address space (0x80000000-0xFFFFFFFF) and their sizes are calculated at boot time as they are derived from the physical memory present and registry values that can influence or dictate how they should be.

The absolute maximum size of nonpaged pool is 256MB, and the maximum for paged pool is ~650MB.

The pools are used for dynamic memory allocations by threads running in kernel mode - nonpaged pool is used typically by drivers, as they need to be guaranteed their buffers will be instantly available because we can't incur a page fault (read from disk to get virtual memory contents back from the pagefile) as this would bugcheck the machine (IRQL_NOT_LESS_THAN_OR_EQUAL).


Small memory pages are 4K in size, but pool allocations are very often not a full page - so different drivers can have neighbouring allocations in the same page.

pool_memory_page

In the above illustration, 3 drivers have made pool allocation requests of ~500 bytes in the order A,B,C,A,C.

For a pool request of X bytes, X+8 bytes are actually required as we need to append a header before the data - this allows us to see what areas are freed later for re-use, and by keeping tabs on the previous allocation size we have a basic form of integrity checking that the list is good.

The last 4 bytes of the 8-byte header are for the pooltag - from Windows Server 2003 onwards this is enabled by default and gives us some way to see who made the pool allocation requests.
For example, the pooltags here could be something like DrvA, BBBB and CcCc


So, Driver A makes a pool allocation request and it is granted, it is given the address of where it can write to its buffer of, say, 500 bytes (technically, through rounding up this would be 504 bytes).

What is in place to make sure the driver doesn't write more than its buffer will hold?
Nothing.

Code running in kernel mode is already privileged enough to do practically anything, so the drivers are trusted to be tidy and play with with the other children.

But buffer overruns occur - code is written by humans, and humans are prone to errors.


If Driver A wrote 600 bytes in its buffer, at the 505th byte it would start to overwrite the pool header of Driver B's allocation, corrupting it.

This will not bugcheck the machine when this write is done - the only time we can get a problem is if the data Driver B relies on in its allocation (which it assumes to be good, as it put it there) causes it to do something unexpected.
Or, if Driver B says it is done with its allocation and the memory manager can free it.

These situations may occur AGES after Driver A corrupted the memory, or maybe it could be milliseconds, who knows.

The point is, if the machine bugchecks then as it lies dying and spasming on the ground it will muster its last energy to point at the guy who killed it... Driver B, the innocent one (just another victim).


So, we load up the debugger and look at the memory dump once the system is back up - it lets us know that its opinion is a pool corruption "probably" caused by Driver B, or maybe ntoskrnl.exe, or perhaps win32k.sys (the last 2 are VERY unlikely to be the real cause, but the debugger is a simple beast and doesn't know any better).

It will likely report that this looks like pool corruption, so all bets are off - no way to be certain who made this mess.


So, in comes our friend Special Pool.

When we enable special pool on specific drivers, any pool allocation requests they make are treated differently - by default we assume they could be overrunning their buffers (we can select to check for underruns instead if we believe this to be the case, but these are rare).

Pool allocations made by the marked drivers are now made from a separate, smaller, "special" pool region and they have an entire 4K page all to themselves, regardless of how much they requested.

The pool allocation is made at the end of the page, so the last byte of the buffer is the very last byte of the 4K page.
The next page is marked as a "guard" page - non-writeable - and through the default behaviour of the memory manager if it receives a request to write here, it will bugcheck the machine.

So now, if Driver A writes more than it was allowed to, the machine bugchecks immediately and the OS & debugger would point directly to the real culprit, still holding the smoking gun.

special_pool

Consider how wasteful this is; a 500-byte pool allocation request has just consumed 2 pages - 8K.
This is the reason it's not enabled by default, and also why you should be selective with which drivers are tagged to use it when troubleshooting.

If the system runs out of special pool, then it will allocate from regular pool without the guard pages - so it really doesn't do any good to make all drivers use it on busy systems.

Typically we recommend using the custom option when enabling special pool, and turn it on for all drivers that are not from Microsoft or Microsoft Corp. – sort the list by the vendor column and this selection becomes a lot easier (just remember to check before and after “M” ;)

 

The memory manager was redesigned in Vista, so there are no fixed regions for pool memory any more and the limits don’t apply – memory pages are allocated for different uses as needed, on demand.

Posted by Paul Adams | 0 Comments
More Posts Next page »
 
Page view tracker