Welcome to TechNet Blogs Sign in | Join | Help

Hyper-V VM State Dump Tool

vm2dmp is a newly released tool to create a complete memory dump of a Hyper-V virtual machine:
http://code.msdn.microsoft.com/vm2dmp

Along with the Debugging Tools for Windows, it can be used to view the valid memory pages of a virtual machine’s snapshot state or saved state.

Very useful if you have a VM which is hung and it is not started in debug mode or configured for a manual dump, or is so frozen that even the Integration Services components do not allow a manual dump to be produced inside the guest.

Also, as it doesn’t actually bugcheck the guest, it can be used in a similar way to ADPlus’ “hang mode”.

Posted by Paul Adams | 0 Comments

Hyper-V Data Exchange Service & “Error 87: The parameter is incorrect” on startup

Recently we came across a case where Windows Server 2008 and Windows Server 2008 R2 servers running under Hyper-V were all failing to start one of the (five) Integration Services offered by the host.

The services created by the Integration Services are:
Hyper-V Data Exchange Service
Hyper-V Guest Shutdown Service
Hyper-V Heartbeat Service
Hyper-V Time Synchronization Service
Hyper-V Volume Shadow Copy Requestor

It was the Hyper-V Data Exchange Service (vmickvpexchange) that was failing to start, and an attempt to start it manually resulted in the error:
”Windows could not start the Hyper-V Data Exchange Service service on Local Computer.
Error 87: The parameter is incorrect.”

 

Through trial & error the problem was isolated to a group policy affecting all of the servers, and specifically the setting controlling the startup and security of the TermService service:

Computer Configuration / Policies / Windows Settings / Security Settings / System Services / Terminal Services

(Note that the display name on W2K8 is “Terminal Services” where on W2K8R2 it is “Remote Desktop Services” – but the underlying service name is consistent so the policy will correctly apply to both versions of the OS.)

Enabling this setting and even leaving it at the default of Automatic as the startup type will cause vmickvpexchange to fail in this way – the reason is because TermService is running under the context of Network Service, whereas vmickvpexchange is running under Local Service.

 

vmickvpexchange needs to be able to verify the state of TermService (but it does not need start/stop control), and by default the Security of the service is set to:
SYSTEM (Allow: Full Control)
Administrators (Allow: Full Control)
INTERACTIVE (Allow: Read)

The simple workaround is to add Local Service with (Allow: Read) permission in the group policy setting’s Security list for Terminal Services / Remote Desktop Services.

Posted by Paul Adams | 0 Comments

Remote Desktop Client & credentials saved in UPN format

Another potential head-scratcher from a recent case…

Consider the situation where you have a client machine running Remote Desktop Client connecting to an RDS server Remote1 in domain Alpha, you authenticate using UPN format credentials (e.g. mytestuser@alpha.local) and elect to save the credentials into your vault.

If the machine you are connecting from is not in domain Alpha, but is in another domain (Beta) then the first connection will be successful, but attempts to use the saved credentials on subsequent connections will fail with “the specified user does not exist”).

 

The reason for this is that the initial connection attempt where the credentials are entered will get a negative result from a DC for domain Beta (expected, as it cannot resolve the UPN to a user account), but then allows a fallback to NTLM authentication to establish the connection (i.e. the credentials are passed to the RDS server to resolve against a DC in domain Alpha).

When pulling the credentials from the vault, however, we assume that they were validated at storage time and so when the “user not found” error is returned from the local DC, it does not attempt to fall back to NTLM.

 

The workaround for the issue is to store the credentials for the connection in DOMAIN\USERNAME format instead (e.g. ALPHA\MYTESTUSER) – then the credentials are explicitly for another domain and so are passed through as an NTLM authentication request to the RDS server.

(Note that the client will need to have the effective group policy setting “Allow Saved Credentials with NTLM-only Server Authentication” populated with TERMSRV\Remote1 or a matching wildcard string (e.g. TERMSRV\*) to be able to used credentials stored in this format.)

 

The problem will not present if any of the following are true:
- the client machine is not in a domain
- there is no DC available for Kerberos requests in domain Beta
- the client is logged on with a local user account (instead of a user account in domain Beta)

Note that we are talking about saved credentials, not cached credentials – they are completely different beasts.

Posted by Paul Adams | 0 Comments

HTTP.SYS / Cryptographic Services / LSASS.EXE deadlock [addendum]

This is a quick update to my previous blog entry http://blogs.technet.com/mrsnrub/archive/2009/11/19/http-sys-cryptographic-services-lsass-exe-deadlock.aspx.

Note that there is actually a typo in the Rapid Publishing article I pointed to, in the Resolution section:
”HKLM\CurrentControlSet\Serivces\HTTP”
should read:
”HKLM\SYSTEM\CurrentControlSet\Services\HTTP”

 

Also, for your convenience I have made a quick & dirty Powershell script to add the dependency to the local registry if it is not present – be aware that you will need to allow the execution of unsigned scripts with “set-executionpolicy RemoteSigned” before trying to run it.

Use this script at your own risk – I’ve tested it very briefly but there is no error checking or backing up of the key/value performed.

Why a Powershell script rather than a .reg file to double-click?
This preserves the DependOnService value in case it is already present and contains data, plus it can be modified to run remotely if needed (by modifying $sComputerName).

 

$sComputerName = '.'

# Check the version of the OS is exactly 6.0, or the workaround does not apply
$oWin32OS = Get-WmiObject -class Win32_OperatingSystem -namespace "root\CIMV2" -computername $sComputerName
$sVerMajor = $oWin32OS.Version[0]
$sVerMinor = $oWin32OS.Version[2]
If (($sVerMajor -ne '6') -or ($sVerMinor -ne '0')) {
  Write-Host "This script is intended only for Windows Server 2008 (NT 6.0), aborting."
  Exit
}

$sKey = "SYSTEM\\CurrentControlSet\\Services\\HTTP"
$sSvc = "DependOnService"
$sDepend = "CryptSvc"

# Connect to local registry
$reg = [Microsoft.Win32.RegistryKey]::OpenRemoteBaseKey('LocalMachine', $sComputerName)

# Open HTTP service key
$regKey = $reg.OpenSubKey($sKey, $True)

# Get the current contents of value 'DependOnService', if it exists
$aSvcs = $regKey.GetValue($sSvc)

If ($aSvcs -eq $null) {
  # Value does not exist, we need to create it with our 1 dependency
  [string[]]$aSvcs = @($sDepend)
  $regKey.SetValue($sSvc, $aSvcs, 'MultiString')
}
else
{
  # Value does exist, we need to check if the dependency is already set
  $bDependencyExists = $False
  ForEach ($a in $aSvcs) {
    If ($a -eq $sDepend) { $bDependencyExists = $True }
  }

  # Only if it is not already present do we add it to the array and update the value
  If (!$bDependencyExists) {
    $aSvcs += $sDepend
    $regKey.SetValue($sSvc, $aSvcs, 'MultiString')   
  }
}

Posted by Paul Adams | 0 Comments

End of an era… in 6 months

On July 13th 2010 2 significant things happen:
1. Windows Server 2000 is no longer supported
2. Windows Server 2003 enters “extended support”

The first point means that there will not even be security updates produced for W2K any more, and it’s officially “self help” if you encounter issues.

The second point means that W2K3 will only be getting security (GDR) updates, delivered by Windows Update.
Also worth noting is that there will not be a Service Pack 3 for W2K3 (which means XP x64 too, don’t forget).

 

Here is the Microsoft Support Lifecycle policy.

The Windows Server Division Weblog has an entry from last September where the new was announced (this is just a 6-month warning for you).

The CSS SQL Server Engineers blog has an entry which mentions this information, as well as 2 SQL Server related support announcements of a similar vein.

 

To avoid rushing at the last minute to get back into a supported configuration, I would recommend you take time to migrate, upgrade or retire any “important” servers still running W2K.

Upgrading to W2K3 would be a rather short-sighted approach, of course, and bear in mind that W2K8R2 is x64 only – there is no in-place upgrade method from any x86 version of Windows to x64, it’s a clean installation.

Don't say I didn’t warn you! ;)

Posted by Paul Adams | 0 Comments

Pre-mortem debug analysis

We’ve looked at generating dumps of processes, the kernel or the entire set of used physical memory pages – but there is another method to do debug analysis on the target directly rather than with a “snapshot” of what it looked like at one point in time, and sometimes this is very useful.

The “live” debug is where we attach a debugger directly to a process (running on the same machine) or a port (on a different machine) to communicate with the kernel.
The latter is very interesting, and until the days of virtualization was very tricky to set up, and was very slow.

A live kernel debug requires the target (debuggee) is running in debug mode – this instructs the kernel to listen on the specified port for instructions from a debugger running on another machine.
If a break instruction is received, the debuggee is then frozen and control is passed to the debugger.

 

Virtually serial

A virtual machine running on Hyper-V has 2 virtual COM ports available to it, but these cannot map to physical serial ports on the host (which might not even have any), so what use are they?
You can specify a name for a a “named pipe” with which any process running on the host or even a remote machine can communicate with the “COM” port of the guest.

For example, if we have a Windows Server 2008 guest machine and set its COM1 to use the name “w2k8”, the named pipe path as accessed by the host itself would be:
\\.\pipe\w2k8

 

Making the kernel listen

Inside the guest machine, the BCD needs to hold the configuration that indicates we want to enable kernel debugging, and the settings to use, so the following 2 command at an elevated command prompt will enable this:

bcdedit /dbgsettings SERIAL DEBUGPORT:1 BAUDRATE:115200
bcdedit /debug ON

The first command indicates kernel debugging is being done via a serial port, which is COM1, at 115,200 baud.
The second command actually enables kernel debug mode during boot.

(Windows versions before Vista had debug settings configured in BOOT.INI instead.)

 

Taking control

On the Hyper-V host you can now launch an elevated WinDbg, select File / Kernel Debug and point to the named pipe path (ensuring “Pipe” is checked and “Baud Rate” is set to 115200).

So long as the host has access to download symbols, you should now be able to handshake with the debuggee kernel to get the version string, and hitting CTRL-BREAK will suspend the debuggee and turn control over to the debugger (until ‘g’ is entered to resume).

While the debugger is connected you will see all the debug spew that the kernel would normally hide from you (if you are running instrumented binaries then this might be a lot).
While the debugger has control the target machine is effectively inaccessible to everyone else, and you can do all the things you can with a .DMP File and more – if you want to crash the target machine to create a memory dump at any time, enter the command ’.crash’.

 

The real power of live debugging, however, is the ability to set breakpoints – allowing the kernel to run as normal until it hits a certain criteria (CPU register contains a specific value, a memory address is accessed or a point in a function is reached).
As soon as it hits a breakpoint the debugger is passed control immediately – for you to run further commands to investigate a problem, or to have it automatically script some actions and resume if it will be hit a lot.

The command ‘bl’ will list the currently-set breakpoints that you have defined (lost each time the target is reset), and ‘bp’ sets an unconditional breakpoint at the address passed to it.
e.g. To make the debugger break in every time any file is opened:
bp nt!NtCreateFile

To clear (remove) a breakpoint, use ‘bc’ and the number assigned to it (visible via ‘bl’).
To disable but leave defined a breakpoint, use ‘bd’ and its number.
In either of the 2 above cases, an asterisk is a wildcard meaning “all breakpoints”.

 

There are also “events” that the kernel can send to the debugger which it can use for control instead of explicit breakpoints, these are configurable in WinDbg through Debug / Event Filters – you can monitor the creation and exit of threads & processes, load and unloading of modules, various normally handled exceptions, and much more.

 

As a live debug has access to the entire still-running system, you can look in the kernel address space or the user-mode address space of a particular process – it is important to force a reload when switching context, the way I always use this is (for fictional PID 12345678):

.process /p /r 12345678

This will set the debugger context in the user-mode portion of this process so that ‘!peb’ will work and the ‘k’ commands will show the function names (symbols permitting) of the user-mode modules instead of memory addresses.

Posted by Paul Adams | 0 Comments

Kernel-mode dump analysis

I’ve already covered the different types of memory dump in a previous blog entry, so this is a quick dip into how we manually trigger a bugcheck to create a memory dump on demand, and also how we can take a look inside the kernel of a running OS without crashing it.

 

Crash Landing

In the event of a hung server, it may be desirable to generate a memory dump manually – all we do here is to deliberately invoke a bugcheck, and the normal error handling takes place for dumping the contents of physical RAM to the pagefile, then extraction from there to MEMORY.DMP on the next restart.

There are many ways to achieve this, but in the case of a total hang, failure to logon and/or unresponsiveness over the network, it limits the choices somewhat.

 

The classic method is “crash on CTRL-SCROLL”, where a PS/2 keyboard was required (along with a registry setting) and the SCROLL LOCK is hit twice whilst holding down the right-hand CTRL key.
This caused a problem when there was a general shift away from the limited value PS/2 ports towards USB keyboards and mice, as this is no longer the same I/O controller.

To get around this problem for Windows Server 2003 (RTM and SP1) there was a hotfix package created, which replaces KBDHID.SYS to allow this to work – this is included in SP2.
The hotfix did not get produced for XP, and did not get put into Vista either… but it is available for Windows Server 2008 SP2, and from Windows 7 onwards.
Ref: Forcing a System Crash from the Keyboard

 

To enable manual dumps via CTRL-SCROLL LOCK with a PS/2 keyboard:
Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters
Name: CrashOnCtrlScroll
Type: REG_DWORD
Data: 1

To enable manual dumps via CTRL-SCROLL LOCK with a USB keyboard (where supported):
Path: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\kbdhid\Parameters
Name: CrashOnCtrlScroll
Type: REG_DWORD
Data: 1

 

Some servers, specifically blade servers, do not have keyboards attached to trigger a dump by this method – but they can have a “Non-Maskable Interrupt” (NMI) button which is a hardware method to achieve the same result – there are also some devices that provide a “virtual NMI button” through a web interface or agent, which gets a kernel mode driver to trigger it manually.

To enable a bugcheck when the NMI button is pressed:
Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
Name: NMICrashDump
Type: REG_DWORD
Data: 1

A bugcheck initiated with CTRL-SCROLL LOCK will have a STOP code ox 0xE2 (MANUALLY_INITIATED_CRASH), while one triggered by NMI will be 0x80 (NMI_HARDWARE_FAILURE).

Trying to analyze these dumps for a cause of the crash is obviously pointless, we know it was done on demand by the user.

Pre-Crash Checks

The bugcheck procedure requires a pagefile to which the contents of physical memory are written – before Windows Vista this had to be present on the boot volume (%systemdrive%), but now this is no longer a limitation, however the size of this must be large enough to accommodate the dump file and the destination must have the same amount of free disk space.

So a Windows Server 2008 machine with 32GB RAM producing a complete memory dump would require a page file of 32GB+50MB, plus the same amount of free disk space in the destination folder (default is %systemroot%\MEMORY.DMP).

The same server producing a kernel memory dump would probably be okay with 2GB+50MB – this will definitely suffice for an x86 server as this is the upper limit for the size of the kernel address space plus overhead for the dump file, but theoretically an x64 server could have a 128GB kernel address space, so it is possible it would need 32GB+50MB as the kernel address space could consume almost all the physical memory.

Personally, I’ve not seen a kernel memory dump bigger than 1GB, even from an x64 system.

A Crash-less Crash Dump

One of the tools that Mark Russinovich created is LiveKd – this allows you to run a kernel debugger locally without crashing the server, so you can take a look inside the kernel even if you’ve not boot in DEBUG mode.

A recent update now allows this tool to run on x64 systems, and systems with >4GB RAM installed.

The tool requires the Debugging Tools for Windows, and as it works by taking a “snapshot” of the kernel and does not freeze it, some of the data cannot be relied upon for accuracy (try tuning a car engine whilst it is being driven!).

The Live Debug

Live debugging requires a separate machine running the debugger, connected to the target machine (debuggee) through one of the supported methods – classically a COM port, but now this is possible over firewire, USB and even named pipes in the case of virtual machines.

Whilst the debugger is in control of the debuggee, the debuggee is frozen – so bear this in mind if you ever do this on a production machine, this server will not respond to anything until it is told to resume by the debugger issuing a ‘g’ (go) command.

The debuggee must be started in DEBUG mode – this is controlled through BOOT.INI before Windows Vista, and through BCDEDIT more recently.

The debugger must have access to symbols for the debuggee, at the absolute minimum it needs to be able to make sense of the “nt” module as this is the kernel – without that there is no chance to work out where the data structures are.

As an example, here is how we would set up a virtual machine in Hyper-V running Windows Server 2008 to be live debugged…

In the Settings of the virtual machine you need to specify a name for the pipe which the host will use to communicate with the VM – here I used the name “w2k8com1” which will lead to the named pipe path “\\.\pipe\w2k8com1” on the host.
(If the debug is being done remotely then the server name is used in place of the dot.)

Fig 1 - VM settings 

Within the virtual machine itself I now need to enable DEBUG mode and tell the kernel to use COM1 at 115200 baud, so from an elevated command prompt enter the following commands:

bcdedit /debug on
bcdedit /dbgsettings serial debugport:1 baudrate:115200

You can verify debug mode is enabled by typing bcdedit – you should see the debug setting set to Yes in the summary.

You can verify the debug settings by typing bcdedit /dbgsettings – you should see the settings entered above reported.

For a more detailed look at the options available, check out this MSDN page.

Set the symbols path on the host – the easiest method is to create a system environment variable named _NT_SYMBOL_PATH and then enter a string allowing a local cache and the upstream server as Microsoft’s public symbol server:

srv*C:\Symbols.pub*http://msdl.microsoft.com/download/symbols

 

Now start up an elevated WinDbg on the host (if it is not elevated you get an access denied message trying to connect to the named pipe).

Click File / Kernel Debug
On the COM tab set the fields as below and then click OK:
- Baud Rate = 115200
- Port = \\.\pipe\w2k8com1
- Pipe = [checked]
Click Yes on the prompt to save the workspace, and now the debugger will sit waiting for the debuggee to send messages or for the user to instruct it to break in

Note that we can’t break into the VM yet as we haven’t started in DEBUG mode – so reboot the VM and watch the debugger window, you will get something like this:

Microsoft (R) Windows Debugger Version 6.12.0001.591 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.
Opened
\\.\pipe\w2k8com1
Waiting to reconnect...
Connected to Windows Server 2008/Windows Vista 6001 x86 compatible target at (Tue Dec 22 10:31:07.543 2009 (UTC + 1:00)), ptr64 FALSE
Kernel Debugger connection established.
Symbol search path is: srv*C:\Symbols.pub*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Server 2008/Windows Vista Kernel Version 6001 MP (1 procs) Free x86 compatible
Built by: 6001.18000.x86fre.longhorn_rtm.080118-1840
Machine Name:
Kernel base = 0x81614000 PsLoadedModuleList = 0x8172bc70
System Uptime: not available

In WinDbg you can either click Debug / Break, hit CTRL-BREAK on the keyboard, this will freeze the debuggee and give you control, with the following message:

Break instruction exception - code 80000003 (first chance)
*******************************************************************************
*                                                                             *
*   You are seeing this message because you pressed either                    *
*       CTRL+C (if you run kd.exe) or,                                        *
*       CTRL+BREAK (if you run WinDBG),                                       *
*   on your debugger machine's keyboard.                                      *
*                                                                             *
*                   THIS IS NOT A BUG OR A SYSTEM CRASH                       *
*                                                                             *
* If you did not intend to break into the debugger, press the "g" key, then   *
* press the "Enter" key now.  This message might immediately reappear.  If it *
* does, press "g" and "Enter" again.                                          *
*                                                                             *
*******************************************************************************
nt!RtlpBreakWithStatusInstruction:
816cc514 cc              int     3

 

Now you can take a look around the system and see what is started, get a summary of the virtual memory, etc. – however be aware that we probably don’t have many of the symbols cached yet, so it is often useful to enable “noisy” symbol loading and then force a reload of all the modules, this way you get the majority of the load delays out of the way right at the start.

You will see the status *BUSY* in the bottom left corner, and each symbol (.PDB) file being sought – depending on the speed of the Internet connection this can take a few minutes:

0: kd> !sym noisy
noisy mode - symbol prompts on
0: kd> .reload /f
Connected to Windows Server 2008/Windows Vista 6001 x86 compatible target at (Tue Dec 22 10:39:05.287 2009 (UTC + 1:00)), ptr64 FALSE
SYMSRV:  ntkrpamp.pdb from
http://msdl.microsoft.com/download/symbols: 1771779 bytes - copied        
DBGHELP: nt - public symbols 
         c:\symbols.pub\ntkrpamp.pdb\37D328E3BAE5460F8E662756ED80951D2\ntkrpamp.pdb
Loading Kernel Symbols
.
SYMSRV:  halmacpi.pdb from
http://msdl.microsoft.com/download/symbols: 74221 bytes - copied        
DBGHELP: hal - public symbols 
         c:\symbols.pub\halmacpi.pdb\E9990686C33F4FF18CB6A56BB9741C471\halmacpi.pdb
.
SYMSRV:  kdcom.pdb from
http://msdl.microsoft.com/download/symbols: 3804 bytes - copied        
DBGHELP: kdcom - public symbols 
         c:\symbols.pub\kdcom.pdb\60CCDB36D40243EC971CFFA9287A8E141\kdcom.pdb

SYMSRV:  peauth.pdb from
http://msdl.microsoft.com/download/symbols: 181517 bytes - copied        
DBGHELP: peauth - public symbols 
         c:\symbols.pub\peauth.pdb\D56829EF703E42D2BCDD8F3C58FBF5532\peauth.pdb
.
SYMSRV:  c:\symbols.pub\secdrv.pdb\7578144C39C4468394EF84F01549113A3\secdrv.pdb not found
SYMSRV: 
http://msdl.microsoft.com/download/symbols/secdrv.pdb/7578144C39C4468394EF84F01549113A3/secdrv.pdb not found
DBGHELP: secdrv.pdb - file not found
*** ERROR: Module load completed but symbols could not be loaded for secdrv.SYS
DBGHELP: secdrv - no symbols loaded
.
SYMSRV:  tcpipreg.pdb from
http://msdl.microsoft.com/download/symbols: 11607 bytes - copied        
DBGHELP: tcpipreg - public symbols 
         c:\symbols.pub\tcpipreg.pdb\F208CF72A7CC4AF490B5B8039AFE2BA31\tcpipreg.pdb

Loading User Symbols

Loading unloaded module list
....

0: kd> !sym quiet
quiet mode – symbol prompts on

Now we have symbols sorted, we can start to look around - !vm will give you a virtual memory overview and the list of processes with their virtual sizes in pages and Kb:

0: kd> !vm
*** Virtual Memory Usage ***
Physical Memory:      130724 (    522896 Kb)
Page File: \??\C:\pagefile.sys
   Current:   1048576 Kb  Free Space:   1048572 Kb
   Minimum:   1048576 Kb  Maximum:      4194304 Kb
Available Pages:       61283 (    245132 Kb)
ResAvail Pages:       106796 (    427184 Kb)
Locked IO Pages:           0 (         0 Kb)
Free System PTEs:     427195 (   1708780 Kb)
Modified Pages:         2429 (      9716 Kb)
Modified PF Pages:      2424 (      9696 Kb)
NonPagedPool Usage:        0 (         0 Kb)
NonPagedPoolNx Usage:   3870 (     15480 Kb)
NonPagedPool Max:      95231 (    380924 Kb)
PagedPool 0 Usage:      3867 (     15468 Kb)
PagedPool 1 Usage:      1906 (      7624 Kb)
PagedPool 2 Usage:        41 (       164 Kb)
PagedPool 3 Usage:        24 (        96 Kb)
PagedPool 4 Usage:        77 (       308 Kb)
PagedPool Usage:        5915 (     23660 Kb)
PagedPool Maximum:    523264 (   2093056 Kb)
Session Commit:         2427 (      9708 Kb)
Shared Commit:          5539 (     22156 Kb)
Special Pool:              0 (         0 Kb)
Shared Process:         1629 (      6516 Kb)
PagedPool Commit:       5922 (     23688 Kb)
Driver Commit:          1801 (      7204 Kb)
Committed pages:       59129 (    236516 Kb)
Commit limit:         382336 (   1529344 Kb)

Total Private:         35993 (    143972 Kb)
         0284 lsass.exe         4883 (     19532 Kb)
         04cc svchost.exe       2889 (     11556 Kb)
         042c svchost.exe       2733 (     10932 Kb)
         01e0 svchost.exe       2718 (     10872 Kb)
         06d8 vmicsvc.exe       2047 (      8188 Kb)
         07a4 ntfrs.exe         1880 (      7520 Kb)
         03f4 LogonUI.exe       1648 (      6592 Kb)
         054c svchost.exe       1544 (      6176 Kb)
         06b4 spoolsv.exe       1245 (      4980 Kb)
         03fc svchost.exe       1171 (      4684 Kb)
         0440 SLsvc.exe         1100 (      4400 Kb)
         0748 dfsrs.exe          953 (      3812 Kb)
         0758 dns.exe            876 (      3504 Kb)
         0474 svchost.exe        716 (      2864 Kb)
         06cc vmicsvc.exe        706 (      2824 Kb)
         0710 vmicsvc.exe        633 (      2532 Kb)
         06ec vmicsvc.exe        633 (      2532 Kb)
         06fc vmicsvc.exe        631 (      2524 Kb)
         039c svchost.exe        600 (      2400 Kb)
         027c services.exe       587 (      2348 Kb)
         0640 taskeng.exe        565 (      2260 Kb)
         0774 ismserv.exe        527 (      2108 Kb)
         0358 svchost.exe        506 (      2024 Kb)
         01b8 svchost.exe        483 (      1932 Kb)
         0354 dfssvc.exe         479 (      1916 Kb)
         0420 svchost.exe        410 (      1640 Kb)
         028c lsm.exe            406 (      1624 Kb)
         01e8 csrss.exe          395 (      1580 Kb)
         0004 System             381 (      1524 Kb)
         0214 csrss.exe          337 (      1348 Kb)
         024c winlogon.exe       321 (      1284 Kb)
         021c wininit.exe        302 (      1208 Kb)
         04b0 svchost.exe        260 (      1040 Kb)
         01bc svchost.exe        222 (       888 Kb)
         020c svchost.exe        134 (       536 Kb)
         01a4 smss.exe            72 (       288 Kb)

As this is a live debug, we have access to all physical memory, so we can set the context to a specific process and look in its user mode space – first we need the process object reference, so let’s pick on LSASS.EXE:

0: kd> !process 0 0 lsass.exe
PROCESS 884aed90  SessionId: 0  Cid: 0284    Peb: 7ffdd000  ParentCid: 021c
    DirBase: 1f7c90e0  ObjectTable: 8eddfb28  HandleCount: 943.
    Image: lsass.exe

Now we know the process object we can switch the debugger context to this process and force a reload of the symbols at the same time:

0: kd> .process /p /r 884aed90
Implicit process is now 884aed90
.cache forcedecodeuser done
Loading User Symbols
................................................................
....................

We can take a look at the Process Environment Block to see what modules are loaded, the command line, window title and environment the process was started with, using the !peb command:

0: kd> !peb
PEB at 7ffdd000
    InheritedAddressSpace:    No
    ReadImageFileExecOptions: No
    BeingDebugged:            No
    ImageBaseAddress:         00700000
    Ldr                       77164cc0
    Ldr.Initialized:          Yes
    Ldr.InInitializationOrderModuleList: 00361500 . 03959380
    Ldr.InLoadOrderModuleList:           00361480 . 03959370
    Ldr.InMemoryOrderModuleList:         00361488 . 03959378
            Base TimeStamp                     Module
          700000 47918d7c Jan 19 06:41:16 2008 C:\Windows\system32\lsass.exe
        770a0000 4791a7a6 Jan 19 08:32:54 2008 C:\Windows\system32\ntdll.dll
        76d40000 4791a76d Jan 19 08:31:57 2008 C:\Windows\system32\kernel32.dll
        76800000 4791a64b Jan 19 08:27:07 2008 C:\Windows\system32\ADVAPI32.dll

        73700000 4549bda2 Nov 02 10:42:58 2006 C:\Windows\system32\rasadhlp.dll
        73580000 4791a74a Jan 19 08:31:22 2008 C:\Windows\system32\rpchttp.dll
        734d0000 4791a6ba Jan 19 08:28:58 2008 C:\Windows\system32\dssenh.dll
        736e0000 4791a775 Jan 19 08:32:05 2008 C:\Windows\system32\cscapi.dll
    SubSystemData:     00000000
    ProcessHeap:       00360000
    ProcessParameters: 00360d90
    CurrentDirectory:  'C:\Windows\system32\'
    WindowTitle:  'C:\Windows\system32\lsass.exe'
    ImageFile:    'C:\Windows\system32\lsass.exe'
    CommandLine:  'C:\Windows\system32\lsass.exe'
    DllPath:      'C:\Windows\system32;C:\Windows\system32;C:\Windows\system;C:\Windows;.;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\'
    Environment:  003607e8
        ALLUSERSPROFILE=C:\ProgramData
        CommonProgramFiles=C:\Program Files\Common Files
        COMPUTERNAME=W2K8
        ComSpec=C:\Windows\system32\cmd.exe
        FP_NO_HOST_CHECK=NO
        OS=Windows_NT
        Path=C:\Windows\System32
        PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
        PROCESSOR_ARCHITECTURE=x86
        PROCESSOR_IDENTIFIER=x86 Family 6 Model 23 Stepping 6, GenuineIntel
        PROCESSOR_LEVEL=6
        PROCESSOR_REVISION=1706
        ProgramData=C:\ProgramData
        ProgramFiles=C:\Program Files
        PUBLIC=C:\Users\Public
        SystemDrive=C:
        SystemRoot=C:\Windows
        TEMP=C:\Windows\TEMP
        TMP=C:\Windows\TEMP
        USERNAME=SYSTEM
        USERPROFILE=C:\Windows\system32\config\systemprofile
        windir=C:\Windows

 

To understand what difference it makes to set the context to a specific process, here is the output for 1 thread returned by !process 884aed90…

Before (in the kernel context):

        THREAD 888dc030  Cid 0284.0878  Teb: 7ff8e000 Win32Thread: 00000000 WAIT: (Executive) UserMode Non-Alertable
            8879839c  NotificationEvent
        IRP List:
            88477590: (0006,01d8) Flags: 00060070  Mdl: 00000000
        Not impersonating
        DeviceMap                 8b008748
        Owning Process            884aed90       Image:         lsass.exe
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      14673          Ticks: 71 (0:00:00:01.109)
        Context Switch Count      311            
        UserTime                  00:00:00.000
        KernelTime                00:00:00.000
        Win32 Start Address 0x74de1b33
        Stack Init 926fc000 Current 926fbb80 Base 926fc000 Limit 926f9000 Call 0
        Priority 11 BasePriority 9 PriorityDecrement 0 IoPriority 2 PagePriority 5
        ChildEBP RetAddr  Args to Child             
        926fbb98 816cb3bf 888dc030 888dc0b8 8170c920 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
        926fbbdc 81668cf8 888dc030 88798340 88477590 nt!KiSwapThread+0x44f
        926fbc30 8186058d 8879839c 00000000 00000001 nt!KeWaitForSingleObject+0x492
        926fbc64 81860cba 00000103 88798340 049bea00 nt!IopSynchronousServiceTail+0x251
        926fbd00 8184a98e 882df448 88477590 00000000 nt!IopXxxControlFile+0x6b7
        926fbd34 8166ba7a 00000c08 00000f48 00000000 nt!NtDeviceIoControlFile+0x2a
        926fbd34 770f9a94 00000c08 00000f48 00000000 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 926fbd64)
WARNING: Frame IP not in any known module. Following frames may be wrong.
        049beb10 00000000 00000000 00000000 00000000 0x770f9a94

 

After (in the process context):

        THREAD 888dc030  Cid 0284.0878  Teb: 7ff8e000 Win32Thread: 00000000 WAIT: (Executive) UserMode Non-Alertable
            8879839c  NotificationEvent
        IRP List:
            88477590: (0006,01d8) Flags: 00060070  Mdl: 00000000
        Not impersonating
        DeviceMap                 8b008748
        Owning Process            884aed90       Image:         lsass.exe
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      14673          Ticks: 71 (0:00:00:01.109)
        Context Switch Count      311            
        UserTime                  00:00:00.000
        KernelTime                00:00:00.000
        Win32 Start Address netlogon!NlWorkerThread (0x74de1b33)
        Stack Init 926fc000 Current 926fbb80 Base 926fc000 Limit 926f9000 Call 0
        Priority 11 BasePriority 9 PriorityDecrement 0 IoPriority 2 PagePriority 5
        ChildEBP RetAddr 
        926fbb98 816cb3bf nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
        926fbbdc 81668cf8 nt!KiSwapThread+0x44f
        926fbc30 8186058d nt!KeWaitForSingleObject+0x492
        926fbc64 81860cba nt!IopSynchronousServiceTail+0x251
        926fbd00 8184a98e nt!IopXxxControlFile+0x6b7
        926fbd34 8166ba7a nt!NtDeviceIoControlFile+0x2a
        926fbd34 770f9a94 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 926fbd64)
        049bea08 770f8444 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
        049bea0c 74f41f34 ntdll!ZwDeviceIoControlFile+0xc (FPO: [10,0,0])
        049beb10 76431693 mswsock!WSPSelect+0x364 (FPO: [Non-Fpo])
        049beb90 753c6b2c WS2_32!select+0x494 (FPO: [Non-Fpo])
        049becc8 753c70c4 DNSAPI!Recv_Udp+0xd7 (FPO: [Non-Fpo])
        049bedb0 753c7591 DNSAPI!Send_AndRecvUdpWithParam+0x1d8 (FPO: [Non-Fpo])
        049bee5c 753c7478 DNSAPI!Send_AndRecv+0x95 (FPO: [Non-Fpo])
        049beee4 753c7334 DNSAPI!Query_Wire+0xed (FPO: [Non-Fpo])
        049beefc 753c30c6 DNSAPI!Query_SingleNamePrivate+0x83 (FPO: [Non-Fpo])
        049bef08 753c4c4b DNSAPI!Query_SingleName+0x1d (FPO: [Non-Fpo])
        049bef2c 753c4929 DNSAPI!Query_AllNames+0xa9 (FPO: [Non-Fpo])
        049bef50 753ca956 DNSAPI!Query_Main+0x7b (FPO: [Non-Fpo])
        049bef6c 753ca8b7 DNSAPI!Query_InProcess+0x68 (FPO: [Non-Fpo])
        049befc8 753c3b34 DNSAPI!Query_PrivateExW+0x30c (FPO: [Non-Fpo])
        049bf004 753cb3a0 DNSAPI!Query_Shim+0xbb (FPO: [Non-Fpo])
        049bf02c 753cbc67 DNSAPI!Query_Private+0x1f (FPO: [Non-Fpo])
        049bf068 753cbc18 DNSAPI!Faz_PrivateEx+0x51 (FPO: [Non-Fpo])
        049bf084 753cbbe6 DNSAPI!Faz_Private+0x18 (FPO: [Non-Fpo])
        049bf0a0 753d97ef DNSAPI!Faz_Simple+0x30 (FPO: [Non-Fpo])
        049bf260 753d3e61 DNSAPI!Faz_CollapseDnsServerListsForUpdate+0x45 (FPO: [Non-Fpo])
        049bf5b8 753dcb94 DNSAPI!Update_Private+0x105 (FPO: [Non-Fpo])
        049bf690 753dcc1c DNSAPI!modifyRecordsInSetPrivate+0x112 (FPO: [Non-Fpo])
        049bf6b8 74da8a9d DNSAPI!DnsModifyRecordsInSet_UTF8+0x20 (FPO: [Non-Fpo])
        049bf8bc 74da8d32 netlogon!NlDnsUpdate+0x252 (FPO: [Non-Fpo])
        049bf8d8 74da9f10 netlogon!NlDnsRegisterOne+0x1c (FPO: [Non-Fpo])
        049bf900 74daccde netlogon!NlDnsScavengeOne+0x77 (FPO: [Non-Fpo])
        049bf94c 74de1bd1 netlogon!NlDnsScavengeWorker+0x1a7 (FPO: [Non-Fpo])
        049bf964 76d84911 netlogon!NlWorkerThread+0x9e (FPO: [Non-Fpo])
        049bf970 770de4b6 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
        049bf9b0 770de489 ntdll!__RtlUserThreadStart+0x23 (FPO: [Non-Fpo])
        049bf9c8 00000000 ntdll!_RtlUserThreadStart+0x1b (FPO: [Non-Fpo])

I marked the first stack frame in kernel mode in both cases so you can see the difference in the debugger’s behaviour – from the “kernel only” view the virtual address 0x770f9a94 is in the user-mode address space of a process (remember 0x00000000-0x7fffffff is user-mode on x86 systems by default).

When we set the context to a specific process we are then able to show its user-mode portion, and the stack looks completely different.

It is possible for thread stacks to be paged out, both user and kernel-mode, in which case the debugger will report this to you as “stack not resident”.

Posted by Paul Adams | 0 Comments

Analyzing User Mode Dumps

So you’ve managed to get a dump from a process… now what?

Dump analysis is a skill that requires a bit of knowledge of how processors work, how to read assembly language, how functions are called, what stacks and heaps are, and so on – it’s way beyond the scope of a blog to give you this set of skills.
But I can show you the guaranteed requirements to make any sense of the .DMP files…

 

Debugger It

A .DMP file is food for a debugger, and only a debugger.
A more correct term for “debugger” is “debugging tool” – as it is you playing the part of the debugger, such a task is way beyond the capacity of a program.

Microsoft have a free debugger for Windows – the Debugging Tools for Windows comes in x86 and x64 flavours depending on your architecture.
In this suite of debuggers the one most commonly used interactively is Windbg.exe, as it has a Windows-style UI and can have its appearance modified to suit through the use of ‘workspaces’.

The debugger itself doesn’t go out of its way to help you as it can be used for so many things – the typical crash dump analysis first step, though, is to enter the command “!analyze –v” to get a summary of the exception information and state of the CPU registers when the program crashed.

If a hang mode dump was produced, I hope it is obvious that !analyze cannot give you anything useful as there is no exception to interpret.

 

Whether you load a crash or hang dump, Windbg will sit looking at you, waiting for input at a prompt looking something like this:
0:001>

Similar to a command prompt, you have to give it something to do in order to get any use out of it… the inevitable question is “what do I type?” which is countered with the question “what do you want to look at?”.

Before we dig into that, the other vital pre-requisite to perform debugging is to have symbol files…

 

Symbolic Representation

While processors work on a limited set of (very, very simplistic) instructions, most programming is not done at this level – it would take too long to produce anything useful and maintenance of the code, however well commented, would be practically impossible.

There are plenty of languages that can be used to create programs, for those which are compiled into a native instruction set (e.g. Intel x86/IA32, AMD64, Itanium) there is the option to create a set of “helper” files, called symbols.

A compiled binary does not often contain much to assist with debugging, it just has the raw (native) code and some some initialized variables (such as text strings) – it is not easy to work your way through a compiled version of even a simple program and work out what it is doing, because of how we branch or jump depending on the results of various calls or instructions.

Conversely, the source code written by the programmer gives a perfect idea of the flow of execution – even better if it has comments to let the reader know what was going through the programmer’s mind when he wrote it.

 

What the symbols provide is a way to debug the compiled binary with some information about where the entry points for functions are, and optionally what arguments they expect to be passed.

The symbols do not help with the execution of the program, so they do not have a place inside the binary – this would bloat them.

In the following examples, the commands entered into the debugger can be seen along with a slightly edited version of their output.
Any colouring present is just for emphasis – the default output is rather dull and monochromatic.

When doing a user-mode debug:

“~*” means “for each thread do the following”
“k” means “show the call stack”

 

Here is an attempt to look at the threads in a debug of calc.exe with no symbols:

0:004> ~* k

   0  Id: f00.1794 Suspend: 1 Teb: 000007ff`fffdd000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`001ed4d8 00000000`76c8c95e USER32!SfmDxSetSwapChainStats+0x1a
00000000`001ed4e0 00000000`ff0b1a8c USER32!GetMessageW+0x2a
00000000`001ed510 00000000`ff0ca00f calc+0x1a8c
00000000`001efc00 00000000`76d8f56d calc+0x1a00f
00000000`001efcc0 00000000`76ec3281 kernel32!BaseThreadInitThunk+0xd
00000000`001efcf0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

   2  Id: f00.9e0 Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`041dfb88 000007fe`fcfe10ac ntdll!ZwWaitForSingleObject+0xa
00000000`041dfb90 00000000`ff0b227e KERNELBASE!WaitForSingleObjectEx+0x9c
00000000`041dfc30 00000000`76d8f56d calc+0x227e
00000000`041dfc70 00000000`76ec3281 kernel32!BaseThreadInitThunk+0xd
00000000`041dfca0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

 

Now, after we set a symbols path and then force a reload, we can make a bit more sense of the thread stacks:

0:004> .sympath srv*C:\Symbols.pub*http://msdl.microsoft.com/download/symbols
0:004> .reload /f
0:004> ~* k

   0  Id: f00.1794 Suspend: 1 Teb: 000007ff`fffdd000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`001ed4d8 00000000`76c8c95e USER32!NtUserGetMessage+0xa
00000000`001ed4e0 00000000`ff0b1a8c USER32!GetMessageW+0x34
00000000`001ed510 00000000`ff0ca00f calc!WinMain+0x1dca
00000000`001efc00 00000000`76d8f56d calc!LDunscale+0x1ea
00000000`001efcc0 00000000`76ec3281 kernel32!BaseThreadInitThunk+0xd
00000000`001efcf0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

   2  Id: f00.9e0 Suspend: 1 Teb: 000007ff`fffd9000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`041dfb88 000007fe`fcfe10ac ntdll!NtWaitForSingleObject+0xa
00000000`041dfb90 00000000`ff0b227e KERNELBASE!WaitForSingleObjectEx+0x79
00000000`041dfc30 00000000`76d8f56d calc!CTimedCalc::WatchDogThread+0x2e
00000000`041dfc70 00000000`76ec3281 kernel32!BaseThreadInitThunk+0xd
00000000`041dfca0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

The symbols path is ready from left to right, and each location is checked for a matching symbols file for each binary the debugger request.
In the above example we have a local cache of public symbols in C:\Symbols.pub – if a symbols file is not found here then the debugger requests it from http://msdl.microsoft.com/download/symbols and then copies it into the local cache for next time.

Symbols are also specific to versions of binaries, as a change to the source code almost always modifies the entry points for functions so attempts to use wrong symbols doesn’t necessarily help, but lead to more confusion.

With symbols present it is possible to “browse” the modules, set breakpoints or unassemble the code using labels rather than meaningless memory addresses.

 

The ‘x’ command is “examine symbols”, and takes a module name followed by a ‘!’ (bang) and a function name – it is fine to use multiple wildcards either side of the bang if you’re not sure of where a function is or its exact name:

0:004> x calc!LDunscale
00000000`ff10d08c calc!LDunscale = <no type information>

0:004> x calc!Win*
00000000`ff0cd31c calc!WinMain = <no type information>
00000000`ff10b0cc calc!WinSqmAddToStream = <no type information>
00000000`ff0c9918 calc!WinSqmAddToStreamEx = <no type information>
00000000`ff10b0d8 calc!WinSqmIncrementDWORD = <no type information>
00000000`ff0cb9b8 calc!WinMainCRTStartup = <no type information>

0:004> x calc!*angle*
00000000`ff125ad0 calc!CContainer::m_iAngle = <no type information>
00000000`ff0c59cc calc!IdcSetAngleTypeDecMode = <no type information>
00000000`ff112080 calc!_imp_GdipFillRectangleI = <no type information>
00000000`ff10b0b4 calc!GdipFillRectangleI = <no type information>
00000000`ff10adf4 calc!cosanglerat = <no type information>

 

Without symbols there is very little sense to be made of dump files, hence why we can’t provide much help when 3rd party programs are crashing or behaving strangely – without the source or at least the symbols there is not much to go on (especially if code has been optimized for speed or size, making it “not so obvious” to work out what it is doing).

 

Tools of the Trade

The above information isn’t much, but it gives you the environment to be able to start debugging – without any examples of crashed or hung processes that are not contrived, it is tricky to go much further.

One book I can recommend, however, for those interested in the topic:

Advanced Windows Debugging
Mario Hewardt, Daniel Pravat

It has been a few years since I learned assembly language, so I don’t have any specific recommendations for books on this foundation topic, but an introduction to the Intel assembly language would be useful – having a good understanding of the way an Intel-compatible CPU handles code execution and data pointers is essential to be able to debug.

Posted by Paul Adams | 0 Comments

User-mode dump creation (Vista onwards)

The ADPlus method of creating dumps is still valid after Windows Server 2003, however there is an easier way to have the OS create the same data which was introduced in Windows Vista…

 

Hung Jury

For processes that are hung or consuming lots of CPU time, you can use Task Manager to create hang mode dumps – on the Processes tab you simply right-click on the process and from the context menu select “Create Dump File” and wait for the message to appear telling you where the dump was created.

Just like ADPlus in hang mode, this does not terminate the process – the threads are suspended whilst a copy of the process’ user-mode virtual address space is dumped to disk, then they are resumed.

Create Dump File on Win7

 

Is there a Dr (Watson) in the house?

Windows Error Reporting (WER) has replace Dr Watson as the default user-mode post mortem debugger, and it is configured through the registry (or group policy).

Here are example registry values to make WER retain up to 25, complete application crash dumps in C:\Dumps:

Path: HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\Windows Error Reporting

Name: DumpCount
Type: REG_DWORD
Data: 25

Name: DumpType
Type: REG_DWORD
Data: 2

Name: DumpFolder
Type: REG_EXPAND_SZ
Data: C:\Dumps

After rebooting, when an application crashes you know exactly where to look for the dump files.

Posted by Paul Adams | 0 Comments

User-mode dump creation (pre-Vista)

For applications that are crashing or hanging, you will need to have the Debugging Tools for Windows present on the machine, and use the script ADPlus.vbs to attach the command line debugger (cdb.exe) to create dump files.

To keep the examples simple I will assume the tools were installed in the folder C:\Debuggers, and the before entering any adplus.vbs commands you have entered the command:

cd /d C:\Debuggers

NOTE: Do not copy/paste any of the commands from here or from formatted emails, as hyphens can get replaced with different characters which are not recognised and the commands will fail – type out the commands as they appear.

 

Crash & Burn

For crashing processes, you attach the debugger and wait for the exception to occur, then the OS passes the exception to it before terminating the process and releasing its resources.

To attach the debugger to a running process FOO.EXE in crash mode, open a command prompt and enter the following:

adplus.vbs –crash –ctcf –pn FOO.EXE –quiet –o C:\Dumps

When the exception is thrown, CDB will catch it and create a new folder in the output folder specified (C:\Dumps in the above example) – the name of the folder will contain the process name, PID and time of the dump so it will always be unique.

 

In some situations “first chance” exceptions are used as an internal message-passing system and we need to ignore these as we are specifically interesting in the unhandled (“second chance”) exceptions, meaning the process is definitely on its way down.

The Print Spooler service (spoolsv.exe) is one such case, and we use another switch to ignore the first chance exceptions that get thrown all the time:

adplus.vbs –crash –ctcf –NoDumpOnFirst –pn SPOOLSV.EXE –quiet –o C:\Dumps

 

Hang ’em High

For processes that hang, there is no exception to catch and so attaching the debugger beforehand will not help – instead you need to wait until the hang is occurring and then open a command prompt to enter this command to create a dump of hung process FOO.EXE:

adplus.vbs –hang –ctcf –pn FOO.EXE –quiet –o C:\Dumps

 

Sometimes it can be useful to take several dumps a few seconds apart, as a hung UI might not indicate a hung process – the threads might be very busy in the background and the main thread is not sending any updates for the user until a big job is complete (think of apps which become “Not Responding” after an action is performed such as loading a file, but then after a while return to life).

The “-r” switch takes 2 parameters: the number of dumps to produce, and the number of seconds to wait between dumps.
So the following example will produce 3 dumps at 10-second intervals from FOO.EXE:

adplus.vbs –hang –ctcf –r 3 10 –pn FOO.EXE –quiet –o C:\Dumps

 

The other scenario you might want to take multiple “hang” dumps is for processes consuming a lot of CPU time (therefore quite clearly not hung) – though typically a better starting position for these cases is to use Process Explorer to look at the call stacks for the busy threads whilst the problem is present and see which modules (and possibly APIs) are involved.

As dumps are like photographs, they can’t give you a good idea of what a thread or process was doing before, or will do after the picture was taken – bear this in mind when trying to make sense of CPU-bound process dumps.

 

Taking the PIDs

If there are multiple instances of a particular process and only 1 is experiencing a crash/hang, then you can specify the unique Process ID (PID) instead of the executable name.

To find out the PID for a particular process you can use any of the following:
- tasklist.exe (at a command prompt)
- Task Manager
- Process Explorer

PIDs are generated at process creation time, so they are not consistent between machines or even on the same machine.

e.g. To take a hang dump of process with PID 1234:

adplus.vbs –hang –ctcf –p 1234 –quiet –o C:\Dumps

 

This can be useful in the case of svchost.exe dumps, for example, when there are always going to be multiple instances.

Posted by Paul Adams | 0 Comments

Goodness gracious, great walls of fire

Ask most people what the default rules should look like for a network firewall and they will likely say “drop” or “stealth” – i.e. if the source address:port & destination address:port combination is not matched then the traffic is silently ignored.

This is often perceived as being more secure than rejecting the connection attempts, based on the premises that if you don’t reply:
- noone knows you are there
- the potential attacker is wasting their time with their port scanning

 

The first argument is flawed in that if it is a targeted attack or the address was found through a DNS query then your presence is already known – also there are methods to tell the difference between a lack of response (a firewall is stealthing the target) and a response indicating the packet could not be routed (the address you sent a packet to does not exist).

The second argument is somewhat pointless as intruders (in deference to true “hackers” I will not use that term) are not sitting at their machines madly mashing the keyboard like in sci-fi movies – a program is simply rotating through IP address ranges and probing specific ports for access, so making this take 1 minute to complete an iteration rather than 1 second makes no difference in the end.

 

So, is stealth pointless?

For “dirty” (public) interfaces, I would still use drop rather than reject, mainly as it costs nothing and saves upstream bandwidth (albeit a tiny amount).

However, for internal interfaces (even on perimeter firewalls) I would always, always recommend reject and never drop.

The reason being that the level of security is identical, but the impact on your machines’ performance may not be adversely affected – you would be treating your own clients & servers as “potential intruders” if you use stealth on the inside, so an incorrectly configured firewall (it happens a lot) or software misconfiguration can lead to long delays caused by timeouts.

 

It also makes troubleshooting network traces from machines having communication issues that much easier – a packet being sent but getting no reply does not tell you whether it never made it there (routing or firewall issue) or the receiving end was having a problem (resource exhaustion, hung server, etc.).

Conversely, a packet instantly getting a rejection from a gateway tells you your rules are incorrect, and allows the thread that caused the packet to be sent to continue without a delay.

 

The next time you’re setting up internal firewalls in your infrastructure, consider this and it might save you a lot of performance issues on your clients or delays waiting for your sysadmin to navigate the intricacies of the change control management scheme your company set up.

Bear in mind also that some firewalls use “Stateful Packet Inspection” (SPI) to look at the traffic passing through and potentially block it based on it “not looking right” even if the port rules would allow the connection – changes to protocols, including additional functionality, can make certain types of traffic (RPC, I’m looking at you) suddenly stop at the border until you get the firewall updated.

Posted by Paul Adams | 0 Comments

Windows System Resource Manager (WSRM) – does exactly what it says on the tin

Originally introduced in Enterprise and Datacenter editions of Windows Server 2003, this feature is now in-box for Standard and upwards SKUs of Windows Server 2008.

As with other features, it is added through Server Manager / Features > Add Feature, and is cunningly named “Windows System Resource Manager” – note that it has a pre-requisite of the “Windows Internal Database” feature which it will install for you too.

A shortcut is also placed in the Administrative Tools folder – when you fire it up you specify which server you are managing (default is Local Computer).

 

So what does it do?

The service uses Windows Management Instrumentation to monitor processes running and keep an eye on their virtual and physical (working set) memory size, plus the amount of overall processor power it can use as an upper limit (note this is not a reservation, but a QoS-type guarantee).

There is one Managing Policy which is active at any given time by the system, you can however create as many as you like under “Resource Allocation Policies” in the MMC.

A policy contains, on its Resources tab, references to “Process Matching Criteria” values – in the properties of each of these references there are tabs for General and Memory which are of interest.

The General tab determines how the processor power not considered “residual” is to be shared assuming everything wants to use all available power – to prevent hanging the server or crippling services beyond belief the Residual criteria may not be removed and has a minimum of 1% (this comes from the implicit inclusion of “Equal_Per_Process”).

On the Memory tab of the resource allocation you can specify the maximum committed memory (virtual size in MB) and have the choice of either logging on event or stopping the application if it hits the threshold – it is also possible to specify the maximum working set for the process in MB too, which will make the Memory Manager kick in earlier to start it paging to disk (rather than killing the process).

 

The “Process Matching Criteria” I referred to are a set of rules defining files/command line and users/groups to which they apply (specifying no user or group will make the rule apply to all).

Too much theory, not enough screenshots… here is the UI with the properties of a Process Matching Criteria viewed…

fig1 - WSRM Process Matching Criteria

…and here are the (General) properties of the Resource Allocation Policy associated with this criteria (also visible from the Dependencies tab of the criteria itself, so you can see who refers to it)…

fig2 - WSRM Resource Allocation Policy - General

…and the Memory tab looks like this…

fig3 - WSRM Resource Allocation Policy - Memory

So in the somewhat pointless example above, the policy Limit_VM_256MB will give preference to the CPU utilization of calc.exe, for members of Users only, up to 99% of available, but stop the process if its virtual size hits 256MB.

(Note that CPU utilization management only kicks in when the total reaches 70%, there is no point in restricting this resource if it’s far from being in contention.)

The screenshots were taken from Windows Server 2008 R2.

A more detailed look into WSRM is available on TechNet – and if you have added the feature then the help topics are pretty useful from within the MMC itself (for instance I didn’t go into the Profiling Policy, Calendar, Accounting, Computer Groups).

Posted by Paul Adams | 0 Comments

Hyper-V Virtual Networks

The most common questions that I get on Hyper-V setups relates to the networking configuration, and it seems to be common thing to get wrong, so I’ll try to go through the 3 types of virtual network we have, and how they differ.

 

A private network can only be used by the child partitions, so consider this a “switch for a purely virtual environment”.

An internal network is the same as a private network, except that the parent partition (physical host) acquires a virtual network adapter which is automatically connected to this virtual switch.

Neither of these 2 types of virtual network require a physical network adapter – so if you are working with a lab or test environment then it’s perfectly fine to have no NICs whatsoever.

 

The third type of virtual network is the external network – this requires a physical NIC which Hyper-V will now take ownership of and unbind every protocol except the “Microsoft Virtual Network Switch Protocol”.

This network type allow communication between any partition connected to it and physical machines on the network connected to the physical NIC.

The physical NIC has now effectively been converted into a switch, which is why there should be nothing other than the “Microsoft Virtual Network Switch Protocol” bound to it – a common mistake here is to think “ah, the host doesn’t have any IPv4 settings, I’ll manually re-enable this…” – do not do this.

Similar to the internal network, when you create an external network the parent is given a new virtual network adapter which is connected to this virtual switch.

The automatically-created virtual network adapters for external networks will be given the original protocol configuration of the physical NIC, so this respect the network adapter as seen by the parent partition OS has been “virtualized”.

If the external network is dedicated for child partitions (recommended configuration – the host should have its own management interface which is not associated with Hyper-V at all) then it is perfectly safe to disable the virtual network adapter associated with the external network (note, do not disable the adapter which is the external network).

 

Take a look at the following diagram, with some explanatory text below:

Hyper-V Virtual Networks

The 1 physical server here, Doc, is the Hyper-V host/parent partition which has 2 physical NICs present: NIC1 and NIC2.

NIC1 has IP settings bound on it and it is not used with any external network – this would be the management interface so we can communicate with the parent partition even if we are reconfiguring Hyper-V or have the hypervisor not running.

NIC2 has the icon of a switch because it has been taken over by Hyper-V and now (just like a regular switch) does not have any IP protocol bound to it – the parent partition has had a virtual NIC created which is connected to this network (this is the one safe to disable if the interface should be for the child partitions only).

Virtual machines Sneezy and Happy are able to communicate with the real world through the external network.

 

In addition to the external network, the blue switch represents an internal network – this creates a new virtual network adapter in the parent partition which is used to allow communication between Sleepy, Bashful, Doc and Dopey (as I decided to multi-home Dopey on an internal and private network in this example).

 

The red switch is for a private network – Doc does not get to connect to this switch so direct communication with Grumpy is not possible except from Dopey.

If Dopey doesn’t have any kind of routing or proxy service present, nothing other than Dopey can talk to Grumpy, and vice versa.

This means the parent partition OS sees a total of 4 network adapters – the 2 representing the physical NICs, 1 for the external network and 1 for the internal network.

 

Now for some screenshots from my home (W2K8R2) setup, where the host has 2 NICs: 1 for management of the host and 1 dedicated for the child partitions:

fig1 - Summary of NICs in parent

Note that I gave the network adapters in the host OS meaningful names BEFORE I created any virtual networks – this is personal taste but makes administration so much easier than dealing with “Local Area Connection #4” and trying to figure out what that is and if it’s safe to disable it.

You can see I disabled the virtual NIC which Hyper-V created for the parent when I made the external network, as it’s dedicated for child partition use.

fig2 - Hyper-V Virutal Network Manager

When I created the external network, I named it based on the NIC which is associated with the network (luckily the machine has 2 different brand NICs onboard to make this easier).

fig3 - Properties of NIC1

Here you can see the physical NIC owned by Hyper-V for the external network should have nothing other than the “Microsoft Virtual Network Switch Protocol” bound to it – this is “just a switch” now.

fig4 - Properties of NIC2

Here are the properties of the other physical NIC which is used for management of the host.

 

Do not toggle the binding for the “Microsoft Virtual Network Switch Protocol” on any interface manually – the Hyper-V Virtual Network Manager UI uses this flag to see if a physical NIC is already associated with an external network before attempting to create one – this has tripped up several people in the past, and what you get is something like the following error:

Error applying new virtual network changes
Setup switch failed
The switch could not bind to {interface name} because it is already bound to another switch.

If you get the option to create an external network on an interface but get this error, this is the only time you should remove that binding manually and retry.

 

This is all just about the creation of virtual networks – you still have to go into the settings of each of the virtual machines and give them a network adapter (legacy or synthetic) for each of the virtual networks it should connect with simultaneously, selected through a drop-down list.

Neither Hyper-V or Virtual Server before it allow “host drive connection”, “USB device redirection” or “drag & drop of files”, even with Integration Services (or VM Additions) installed – this is a potential security hole, so you need to configure a common network between the parent and child partitions if you want to transfer files in & out.

 

Possible workarounds if you really don’t want to set up a network to transfer files between the parent and child partitions:

- create an ISO on the parent with the files to copy into the child, and mount it as a virtual DVD

- with the VM shut down, mount the VHD file from the child partition on Windows 7 /Server 2008 R2 directly through Disk Management (this is useful for getting MEMORY.DMP files out of VMs that are bugchecking during boot too)

 

A final note on Hyper-V networking – there is no virtual DHCP server, so you need to either set up your own, use an existing one (if using an external network) or assign IPv4 addresses manually.

I tend to use the 3 different private subnets to easily identify which type of network the machine is meant to connect with, and also avoid potential disasters if I accidentally connect to the wrong one:
- 172.16.0.1 thru 172.31.255.254 for PRIVATE networks
- 192.168.0.1 thru 192.168.255.254 for INTERNAL networks
- 10.0.0.1 thru 10.255.255.254 for EXTERNAL networks

 

A final note: just as with physical machines, if you multi-home a VM then check your protocol bindings, network adapter order and gateway/route settings to make sure you avoid performance or security issues.

Posted by Paul Adams | 0 Comments

HTTP.SYS / Cryptographic Services / LSASS.EXE deadlock

A recent case I had brought this issue to my attention, so I thought it useful to share the knowledge…

The problem encountered was a Windows Server 2008 x64 SP2 server running several websites was failing to start several services during startup, and attempts to logon stuck at “Applying User Settings…” indefinitely.
Starting in Safe Mode (or Safe Mode with Networking) worked fine.

By the time the case had been opened the symptom had already been removed by taking away some certificates – the server then started correctly when booted normally (adding re-adding the certificates did not reintroduce the problem).

 

From a VM image of the server in the problem state I made a complete memory dump when the server had been stuck ~10 minutes during user logon.

From the hang dump I could see the logon was stalled because one of the threads in LSASS.EXE was waiting for the Cryptographic Services service to start – an ALPC was sent to to the Service Control Manager (SCM) to poke the thread when the service was up.

SCM was in the process of starting HTTP.SYS, which held the lock for service startup (preventing the Cryptographic Services service from starting) and HTTP.SYS’s thread requires the services of LSASS.EXE…deadlock.

http-lsass-cryptsvc_deadlock

This is a classic timing issue – HTTP.SYS is slow to load (for some reason relating to the SSL binding information) and is holding the SCM lock long enough to stumble into the “Mexican stand-off”.

We can iron out the kink by putting a dependency for HTTP.SYS on Cryptographic Services (CryptSvc), so it will always wait, so when it makes a call into LSASS.EXE it is not holding the lock preventing LSASS.EXE’s dependency from starting.

This workaround is described here: Computer hangs at Applying Computer Settings or All Automatic Services Will Not Start After Reboot on Windows Server 2008

 

Does this mean that the symptom of “Automatic services failing to start” are caused by this one issue?
NO

The symptom is unfortunately very generic – there is so much going on with a system startup that there are many ways to encounter a “hang” with totally different root causes.

It is very easy to say “I am experiencing the symptoms described in KB article XXX, but the solution/workaround did not work” – this invariably means that you did not have that problem described by the article.

But if you have a server with certificates installed and SSL is configured (most probably for use by IIS), and you’re running Windows Server 2008 then it’s worth knowing about this issue and trivial workaround.

(Windows Server 2008 R2 does not have the problem, by the way.)

 

Posted by Paul Adams | 1 Comments

Be kind, rewind (but don’t reboot)

One very common belief I have come across is that rebooting Windows somehow “cleans” the system and returns it to normal speed after some performance degradation (and further that reinstalling the OS periodically does some magical cleaning too).

For the most part, this is complete nonsense.

Shutting down Windows will terminate all processes & services and empty the system cache, and starting from clean will cause a system initialization check (was the previous shutdown clean, does the disk need checking or a dump extracting, etc.) followed by a mass struggle for domination of various components wanting to fire up.

See the previous blog entry where I talked about contention – a system startup is a bottleneck, luckily one-off, where all the various parts of the OS plus 3rd party services will want to get started (and then often sit in an idle state for the majority of their lives).

Once this startup procedure is over and the OS is sitting at the authentication prompt (or the desktop, if the user added to the contention by wanting to logon and incur additional load with their logon processes), we now have an empty system cache.

In some (client) environments Superfetch can kick in after the system has been idle a while and start to load in files that it has observed the user requesting in a pattern – this starts to pre-populate the system cache again to remove the delay caused by disk I/O when the file is actually requested.

As file I/O is done, the Cache Manager works with the Memory Manager to keep virtual blocks of files in memory, this is to allow efficient re-use of file without incurring I/O.
Windows caches on file sections, not entire files, for efficiency – and processes reading the same file will use pointers to the same cached file sections.

So a populated system cache is a good thing – unused memory is wasted.
Pages in the system cache age, and pages will get trimmed (paged to disk or freed) based on how long ago they were accessed (i.e. a cache “hit”), if the system needs to free physical memory to satisfy requests for memory allocations then the cache is checked before processes get their working sets trimmed.

 

In some rare cases, the OS can suffer performance issues after it has been up for a while – however this is not expected or normal, and a reboot to “resolve” this is just masking a problem that should be investigated.

Performance issues most often come from… again… contention.

CPU contention – this can occur if (for example) a multi-processor system has all but 1 CPU stuck in a spinlock state, putting contention on that single CPU (if all CPUs were spinning then the system would be hung, not slow).

There are pools of worker threads in the kernel which deal with queues of work items – if some of these get into a hung state or the queues are backlogged, some, most or all threads in the system can end up in the wait state for much longer than is normal (allowing the queues to build up as time goes by, compounding the problem).

Memory resources come in different flavours for different purposes – I am not talking just physical vs virtual, but things like page table entries (PTEs), paged pool and nonpaged pool.
A system that runs low or out of PTEs will invariably hang.
A system that runs out of paged pool has likely had a leak of some kind in a driver, or the 3GB switch is in use on a busy server, and can result in performance degradation due to constant trimming, a hang or possibly even a crash if a driver requests it with “must succeed”.
Nonpaged pool is similar to paged pool, with the exception that it cannot ever be paged out to disk – typically this is used by drivers at device interrupt level – as with paged pool this can result in a severe performance drop, a hang or a crash in extreme circumstances.

Locks can end up with lots of waiters, if the current owner holds it longer than normal or there are 2 or more threads constantly grabbing and releasing one – locks are a necessary evil so that we ensure coordinated access to data structures and maintain data integrity, but can lead to scalability issues.
In particularly bad cases a deadlock can be encountered – 2 threads that each hold a lock and wait indefinitely on the other to be available, this can hang some parts of the system, tie up worker threads or slow down the entire system until it hangs.

 

When you start to look at the possible things that can go wrong in an OS, it is more surprising that it works most of the time!

So next time you believe your servers to be going slower than you expect, have a look at the nature of the performance issue:

Is it slow to logon?
- What is the message displayed during the longest delays while you wait for the desktop?

Is it slow to start new processes?
- Is it the same for all processes, or just certain ones?
- Are processes that are already running working at normal speed?
- Does Task Manager show the CPUs are all under constant high load?
- Is something grinding the disk? (If so, is it paging or file I/O?)

 

Process Explorer is a great tool for identifying where CPU time is being spent, from hardware interrupts down to individual threads in processes.

Resource Monitor (for Vista onwards) is great for real-time analysis of file I/O – which process is incurring what amount of read or write I/O against which file objects.

Performance Monitor (PerfMon) is great for logging performance to view statistical data on memory usage, CPU time, network throughput, etc.
If you use PerfMon, the valuable counters are current (and very occasionally peak), not average - very rarely is an averaged value going to be of use for performance analysis.
Current CPU and disk queue length can show a backlog of “work pending”, while System counters like Pool Paged Bytes and Pool Nonpaged Bytes only have meaning if you know what the maximum is (based on system configuration) – Free System PTEs is, however, a useful counter (keep it over 10,000 is a basic rule of thumb).

 

The next time your system feels a bit sluggish, take a look at what “sluggish” is to you, and try to identify what is currently saturated rather than reach for the “Restart” button on the Start menu.

Identifying and fixing a problem is better than ignoring or working around it.

Posted by Paul Adams | 0 Comments
More Posts Next page »
 
Page view tracker