Blog - Title

Mark's Blog

  • Hunting Down and Killing Ransomware

    Scareware, a type of malware that mimics antimalware software, has been around for a decade and shows no sign of going away. The goal of scareware is to fool a user into thinking that their computer is heavily infected with malware and the most convenient way to clean the system is to pay for the full version of the scareware software that graciously brought the infection to their attention. I wrote about it back in 2006 in my The Antispyware Conspiracy blog post, and the fake antimalware of today doesn’t look much different than it did back then, often delivered as kits that franchisees can skin with their own logos and themes. There’s even one labeled Sysinternals Antivirus:

    image

    A change that’s been occurring in the scareware industry over the last few years is that most scareware today also classifies as ransomware. The examples in my 2006 blog post merely nagged you that your system was infected, but otherwise let you continue to use the computer. Today’s scareware prevents you from running security and diagnostic software at the minimum, and often prevents you from executing any software at all. Without advanced malware cleaning skills, a system infected with ransomware is usable only to give in to the blackmailer’s demands to pay.

    In this blog post I describe how different variants of ransomware lock the user out of their computer, how they persist across reboots, and how you can use Sysinternals Autoruns to hunt down and kill most current ransomware variants from an infected system.

    The Prey

    Before you can hunt effectively, you must first understand your prey. Fake-antimalware-type scareware, by far the most common type of ransomware, usually aims at being constantly annoying rather than completely locking a user out of their system. The prevalent strains use built-in lists of executables to determine what that they will block, which usually includes most antimalware and even the primary Sysinternals tools. They customarily let the user run most built-in software like Paint, but sometimes will block some of those. When they block an executable they display a dialog falsely claiming that it was blocked because of an infection:

    image

    But malware has gotten even more aggressive in the last couple of years, not even pretending to be anything other than the ransomware that they are. Take this example, which completely takes over a computer, blocking all access to anything except its own window, and demands an unlock code to regain the use of the system that the user must purchase by calling the number specified (in this case one with a Russian country code) :

    image

    Here’s one that similarly takes over the computer, but forces the user to do some online shopping to redeem the computer’s use (I haven’t investigated to see what amount of purchasing returns the use of the computer):

    image

    And here’s one that I suppose can also be called scareware, because it informs the user that their system harbors child pornography, something that would be horrifying news to most people. The distributor must believe that the fear of having to defend against charges of child pornography will dissuade victims from going to the authorities and convince them to instead pay the requested fee.

    image

    Some ransomware goes so far as to present itself as official government software. Here’s one supposedly from the French police that informs users that pirated movies reside on their computer and they must pay a fine as punishment:

    image

    As far as how these malefactors lock users out of their computer, there are many different techniques in practice. One commonly used by the fake-antimalware variety, like the Security Shield malware shown in an earlier screenshot, is to block the execution of other programs by simply watching for the appearance of new windows and forcibly terminating the owning process. Another technique, used by the online shopping ransomware example pictured above, is to hide any windows not belonging to the malware, thus technically enabling you to launch other software but not to interact with it. A similar approach is for malware to create a full-screen window and to constantly raise the window to the top of the window order, obscuring all other application windows behind it. I’ve also seen more devious tricks, like one sample that creates a new desktop and switches to it, similar to the way Sysinternals Desktops works – but while your programs are still running, you can’t switch to their desktop to interact with them.

    Finding a Position from Which to Hunt

    The first step for cleaning a system of the tenacious grip of ransomware is to find a place from which to perform the cleaning. All of the lock-out techniques make it impossible to interact with a system from the infected account, which is typically its primary administrative account. If the victim system has another administrative account and the malware hasn’t hijacked a global autostart location that infects all accounts, then you’ve gotten lucky and can clean from there. 

    Unfortunately, most systems only have one administrative account, removing the alternate account option. The fallback is to try Safe Mode, which you can reach by typing F8 during the boot process (reaching Safe Mode is a little more difficult in Windows 8). Most ransomware configures itself to automatically start by creating an entry in the Run or RunOnce key of HKCU\Software\Microsoft\Windows\CurrentVersion (or the HKLM variants), which Safe Mode doesn’t process, so Safe Mode can provide an effective platform from which to clean such malware. A growing number of ransomware samples modify HKCU\Software\Microsoft\Window NT\CurrentVersion\Winlogon\Shell (or the HKLM location), however, which both Safe Mode and Safe with Networking execute. Safe Mode with Command Prompt overrides the registry shell selection, so it circumvents the startup of the majority of today’s ransomware and is the next fallback position:

    image

    Finally, if the malware is active even in Safe Mode with Command Prompt, you’ll have no choice but to go offline hunting in an alternate Windows installation. There are a number of options available. If you have Windows 8, creating a Windows 8 To Go installation is ideal, since it is literally a full version of Windows. An alternative is to boot the Windows Setup media and type Shift+F10 to open a command-prompt when you reach the first graphical screen:

    image

    You won’t have access to Internet Explorer and many applications won’t work properly in Windows Setup’s stripped-down environment, but you can run many of the Sysinternals tools. Finally, you can create a Windows Preinstallation Environment (WinPE) boot media, which is an environment similar to that of Windows Setup and something that Microsoft Diagnostic and Repair Tooltkit (MSDaRT) uses.

    The Hunt

    Now that you’ve found your hunting spot, it’s time to select your weapon. The easiest to use is of course off-the-shelf antimalware software. If you’re logged in to an alternate account or Safe Mode you can use standard online-scanning products, many of which are free, like Microsoft’s own Windows Defender. If you’re booted into a different Windows installation, however, then you’ll need to use an offline scanner, like Windows Defender Offline. If the antimalware engine isn’t able to detect or clean the infection, you’ll have to turn to more a more precise and manual weapon.

    One utility that enables you to rip the malware’s tendrils off the system is Sysinternals Autoruns. Autoruns is aware of over a hundred places where malware can configure itself to automatically start when Windows boots, a user logs in, or a specific built-in application launches. The way you need to run it depends on what environment you’re hunting from, but in all cases you should run it with administrative rights. Also, Autoruns automatically starts scanning when you start it; you should abort the initial scan by pressing the Esc key, then open the Filter dialog and select the options to verify signatures and to hide all Microsoft entries so that malware will appear more prominently, and restart the scan:

    image

    If you’re logged into a different account from the one that’s infected, then you need to point Autoruns at the infected account by selecting it from the User menu. In this example Autoruns is running in the Fred account, but the one that’s infected is Abby, so I’ve selected the Abby profile:

    image

    If you’ve booted into a different operating system then you need to use Autoruns offline support, which requires you to specify the root of the target Windows installation and the target user profile. Open the Analyze Offline System dialog from the File menu and enter the appropriate directories:

    image

    After Autoruns has scanned the system, you have to spot the malware. As I explain in my Malware Hunting with the Sysinternals Tools presentations, malware often exhibits the following characteristics:

    image

    Of course, since Autoruns just shows autostart configuration and not running processes, some of these attributes are not relevant. Nevertheless, I’ve found in my examination of several dozen variants of current ransomware that all of them of them satisfy more than one, most commonly by not having a description or company name and having a random or suspicious file name.

    One downside to offline scanning is that signature verification doesn’t work properly. This is because Windows uses catalog signing, as opposed to direct image signing, where it stores signatures in separate files rather than in the images themselves. Autoruns doesn’t process offline catalog files (I’ll probably add that support in the near future), so all catalog-signed images will show up as unverified and highlighted in red.  Since most malware doesn’t pretend to be from Microsoft, you can try an initial scan with the option to verify code signatures unchecked. Here’s the result of an offline scan with signature verification disabled of a ransomware infection that takes over two autostart locations - see if you can spot them:

    image

    If you are unsure about an image, you can try uploading it to Virustotal.com for analysis by around 40 of the most popular antivirus engines, searching the Web for information, and looking at the strings embedded in the file using the Sysinternals Strings utility.

    The Kill

    Once you’ve determined which entries belong to malware, the next step is to disable them by deselecting the checkboxes of their autostart entries. This will allow you to re-enable the entries later if you discover you made a mistake. It doesn’t hurt to also move the malware files and any other suspicious files in the directory of the ones configured to autostart to another directory. Moving all  the files makes it more likely that you’ll break the malware even if you miss an autostart location.

    Next, check to see if your prey is dead by booting the system and logging into the account that was infected. If you still see signs of an infection, you might have missed something in your Autoruns analysis, so repeat the steps. If that doesn’t yield success, the malware may be a more sophisticated strain, for example one that infects the Master Boot Record or that infects the system in some other unconventional way to persist across reboots. There is also ransomware that goes further and encrypts files, but they are relatively rare. Fortunately, ransomware authors are lazy and generally don’t need to go to such extents to be effective, so a quick analysis with Autoruns is virtually always lethal.

    Happy hunting!

    If you liked this post, you’ll like my two highly-acclaimed cyberthriller novels, Zero Day and Trojan Horse. Watch their exciting video trailers, read sample chapters and find ordering information on my personal site at http://russinovich.com

  • The Case of the Unexplained FTP Connections

    A key part of any cybersecurity plan is “continuous monitoring”, or enabling auditing and monitoring throughout a network environment and configuring automated analysis of the resulting logs to identify anomalous behaviors that merit investigation. This is part of the new “assumed breach” mentality that recognizes no system is 100% secure. Unfortunately, the company at the heart of this case didn’t have a comprehensive monitoring system, so had been breached for some time before updated antimalware signatures cleaned their infection and brought the breach to their attention. Besides highlighting just how weak cybersecurity is at many companies, this case highlights the use of several Sysinternals Process Monitor features, including the Process Tree dialog and one feature many people aren’t aware of, Process Monitor’s ability to monitor network activity.

    The case opened when a network administrator at a South African company contacted Microsoft Services Premier Support and reported that their corporate Exchange server, running on Windows Server 2008 R2, appeared to be making outbound FTP connections. They noticed this only because the company’s installation of Microsoft Forefront Endpoint Protection (FEP) alerted them that it had cleaned a piece of malware it found on the server. Concerned that their network might still be compromised despite the fact that FEP claimed the system was malware-free, he examined the company’s perimeter firewall logs. To his horror, he discovered FTP connections that numbered in the hundreds per day and dated back several weeks. Instead of attempting a forensic examination on his own, he called on Microsoft’s security consulting team, which specializes in helping customers clean up after an attack.

    The Microsoft support engineer assigned the case began by capturing a five-minute Process Monitor trace of the Exchange server. After stopping the trace he opened the Process Tree dialog (under the Tools menu), which shows the parent-child relationships of all the processes that existed at any point in the current trace. He quickly found that around 20 FTP processes had been launched during the collection, each of them short-lived, except for one, which was still active (process 7324 below):

    image

    The engineer looked at the command lines for the FTP processes by selecting them in the tree so that their details appeared at the bottom of the Process Tree dialog. The command lines for the half of them bizarrely included just the “-?” argument, which simply brings up FTP help:

    image

    The other half were more interesting, including “-i” and “-s” switches:

    image

    The –i switch has FTP turn off prompting for multiple file transfers, and –s directs FTP to execute the FTP commands listed in a file, in this case a file named “j”.  Setting out to find out what file '”j” contained, he clicked on the “Include Process” button at the bottom of the Process Tree dialog so that he could find the process’s file events:

    image

    He searched the resulting filtered trace for “j” and found the file’s location in several of the events:

    image

    He navigated to the C:\Windows\System32\i4333 directory, but the “j” file was gone. That being a dead end, he turned his attention to the FTP process’s parent, Cmd.exe, and looked at its command line. The line was too long and convoluted to easily understand:

    image

    He selected it, typed Ctrl+C to copy it to the clipboard, pasted it into Notepad, and decomposed it into its constituent components, each of which was separated by a “&”. The result looked like this:

    SNAGHTML143cd3bc

    The first instruction has the command prompt create a directory named i4333 and then start creating the contents of the “j” file. The commands it writes into “j” instruct FTP to connect to NUXZb.in.into4.info, login with the user name “New” and the password “123”, then download all the files on the FTP server that end with “.exe”. After FTP has processed the file, the command prompt deletes “j” and then creates a batch file that executes the downloaded files, first using the Shell to launch them (“start”) and then the Command Prompt.

    A quick detour to Whois showed the engineer that the NUXZb hostname was issued by Protected Name Services and didn’t reveal any useful information. The engineer toggled off Process Monitor’s network name resolution and found the outbound FTP connection in the trace to see the IP address the name had resolved to:

    image

    An IP address location lookup on the Web pinpointed the IP address at an ISP in Chicago (the name now resolves to a different IP address), so he concluded the connection was to a server that was also compromised or one the attacker had hosted at the ISP. Finished analyzing the command line, he looked at the contents of the resulting script, D.bat, which was still in the directory and contained this single command:

    image

    Not coincidentally, 134.exe was the executable Forefront had flagged as a remote access Trojan (RAT) in the alerts that the administrator first responded to. The script could therefore not find it, making it seem that the attack – or at least this part of it - had been neutralized by FEP. It also implied that the attack was automated and stuck in a loop trying to activate.

    The engineer next set out to determine how the command-prompt processes were being launched. Looking at their parent processes in the process tree, he learned they were all launched from Sqlserver.exe:

    image

    This obviously wasn’t a good sign, but it wasn’t the worst of it: examining SQL Server’s network activity in the trace, he saw many incoming connections:

    image

    Lookups of the IP address locations placed them in China, Tunisia, Taiwan, and Morocco:

    image

    The SQL Server was being used by an attacker or multiple attackers from around the world in countries known for being cybercriminal safe havens. It was clearly time to flatten the server, but before calling the administrator to give him the bad news and advise him to immediately disconnect the server from the network, he thought he’d spend a few minutes examining the security of the SQL Server. Understanding what had led to the compromise could help the company avoid being compromised the same way again.

    He launched a Microsoft support batch file that checks various SQL Server security settings. The tool ran for a few seconds and then printed its discouraging results: the server had an administrator account with a blank password, was configured for mixed-mode authentication, and allowed stored procedures to launch command prompts via the enablement of the “xp_cmdshell” feature:

    image

    That meant that anyone on the Internet could logon to the server without a password and execute executables – like FTP – to infect the system with their own tools.

    With the help of Process Monitor and some discussion with the company’s administrator, the support engineer had a solid theory for what had happened: an administrator at the company had installed SQL Server on the company’s Exchange server several weeks prior to the incident. Not realizing the server was on the perimeter, they had opened the SQL Server’s port in the local firewall, left it with a blank admin account, and enabled xp_cmdshell. It goes without saying that even if the server wasn’t on the Internet, that configuration leaves a server without any network security. Not long after, automated malware scanning the Internet for exposed targets had stumbled across the open SQL port, infected the server with malware, and likely enlisted it in a Botnet. FEP signatures for the new malware variant were delivered to the server some time later and removed the infection. The Botnet-enlisting malware was still trying to reintegrate the server when the case with Microsoft support was opened. While the company can’t know how much – if any – of its corporate data was pilfered during the infection, this was a very loud and clear wakeup call.

    You can test your own cybersecurity knowledge by taking my Operation Desolation cybersecurity quiz.

  • Windows Azure Host Updates: Why, When, and How

    Windows Azure’s compute platform, which includes Web Roles, Worker Roles, and Virtual Machines, is based on machine virtualization. It’s the deep access to the underlying operating system that makes Windows Azure’s Platform-as-a-Service (PaaS) uniquely compatible with many existing software components, runtimes and languages, and of course, without that deep access – including the ability to bring your own operating system images – Windows Azure’s Virtual Machines couldn’t be classified as Infrastructure-as-a-Service (IaaS).

    The Host OS and Host Agent

    Machine virtualization of course means that your code - whether it’s deployed in a PaaS Worker Role or an IaaS Virtual Machine - executes in a Windows Server hyper-v virtual machine. Every Windows Azure server (also called a Physical Node or Host) hosts one or more virtual machines, called “instances”, scheduling them on physical CPU cores, assigning them dedicated RAM, and granting and controlling access to local disk and network I/O.

    The diagram below shows a simplified view of a server’s software architecture. The host partition (also called the root partition) runs the Server Core profile of Windows Server as the host OS and you can see the only difference between the diagram and a standard Hyper-V architecture diagram is the presence of the Windows Azure Fabric Controller (FC) host agent (HA) in the host partition and the Guest Agents (GA) in the guest partitions. The FC is the brain of the Windows Azure compute platform and the HA is its proxy, integrating servers into the platform so that the FC can deploy, monitor and manage the virtual machines that define Windows Azure Cloud Services. Only PaaS roles have GAs, which are the FC’s proxy for providing runtime support for and monitoring the health of the roles.

    image

    Reasons for Host Updates

    Ensuring that Windows Azure provides a reliable, efficient and secure platform for applications requires patching the host OS and HA with security, reliability and performance updates. As you would guess based on how often your own installations of Windows get rebooted by Windows Update, we deploy updates to the host OS approximately once per month. The HA consists of multiple subcomponents, such as the Network Agent (NA) that manages virtual machine VLANs and the Virtual Machine virtual disk driver that connects Virtual Machine disks to the blobs containing their data in Windows Azure Storage. We therefore update the HA and its subcomponents at different intervals, depending on when a fix or new functionality is ready.

    The steps we can take to deploy an update depend on the type of update. For example, almost all HA-related updates apply without rebooting the server. Windows OS updates, though, almost always have at least one patch, and usually several, that necessitate a reboot. We therefore have the FC “stage” a new version of the OS, which we deploy as a VHD, on each server and then the FC instructs the HAs to reboot their servers into the new image.

    PaaS Update Orchestration

    A key attribute of Windows Azure is its PaaS scale-out compute model. When you use one of the stateless virtual machine types in your Cloud Service, whether Web or Worker, you can easily scale-up and scale-down the role just by updating the instance count of the role in your Cloud Service’s configuration. The FC does all the work automatically to create new virtual machines when you scale out and to shut down virtual machines and remove when you scale down.

    What makes Windows Azure’s scale-out model unique, though, is the fact that it makes high-availability a core part of the model. The FC defines a concept called Update Domains (UDs) that it uses to ensure a role is available throughout planned updates that cause instances to restart, whether they are updates to the role applied by the owner of the Cloud Service, like a role code update, or updates to the host that involve a server reboot, like a host OS update. The FC’s guarantee is that no planned update will cause instances from different UDs to be offline at the same time. A role has five UDs by default, though a Cloud Service can request up to 20 UDs in its service definition file. The figure below shows how the FC spreads the instances of a Cloud Service’s two roles across three UDs.

    image

    Role instances can call runtime APIs to determine their UD and the portal also shows the mapping of role instances to UDs. Here’s a cloud service with two roles having two instances each, so each UD has one instance from each role:

    image

    The behavior of the FC with respect to UDs differs for Cloud Service updates and host updates. When the update is one applied by a Cloud Service, the FC updates all the instances of each UD in turn. It moves to a subsequent UD only when all the instances of the previous have restarted and reported themselves healthy to the GA, or when the Cloud Service owner asks the FC via a service management API to move to the next UD.

    Instead of proceeding one UD at a time, the order and number of instances of a role that get rebooted concurrently during host updates can vary. That’s because the placement of instances on servers can prevent the FC from rebooting the servers on which all instances of a UD are hosted at the same time, or even in UD-order. Consider the allocation of instances to servers depicted in the diagram below. Instance 1 of Service A’s role is on server 1 and instance 2 is on server 2, whereas Service B’s instances are placed oppositely. No matter what order the FC reboots the servers, one service will have its instances restarted in an order that’s reverse of their UDs. The allocation shown is relatively rare since the FC allocation algorithm optimizes by attempting to place instances from the same UD - regardless of what service they belong to - on the same server, but it’s a valid allocation because the FC can reboot the servers without violating the promise that it not cause instances of different UDs of the same role (of the a single service) to be offline at the same time.

    image

    Another difference between host updates and Cloud Service updates is that when the update is to the host, however, the FC must ensure that one instance doesn’t indefinitely stall the forward progress of server updates across the datacenter. The FC therefore allots instances at most five minutes to shut down before proceeding with a reboot of the server into a new host OS and at most fifteen minutes for a role instance to report that it’s healthy from when it restarts. It takes a few minutes to reboot the host, then restart VMs, GAs and finally the role instance code, so an instance is typically offline anywhere between fifteen and thirty minutes depending on how long it and any other instances sharing the server take to shut down, as well as how long it takes to restart. More details on the expected state changes for Web and Worker roles during a host OS update can be found here. Note that for PaaS services the FC manages the OS servicing for guests as well, so a host OS update is typically followed by a corresponding guest OS update (for PaaS services that have opted into updates), which is orchestrated by UD like other cloud service updates.

    IaaS and Host Updates

    The preceding discussion has been in the context of PaaS roles, which automatically get the benefits of UDs as they scale out. Virtual Machines, on the other hand, are essentially single-instance roles that have no scale-out capability. An important goal of the IaaS feature release was to enable Virtual Machines to be able to also achieve high availability in the face of host updates and hardware failures and the Availability Sets feature does just that. You can add Virtual Machines to Availability Sets using PowerShell commands or the Windows Azure management portal. Here’s an example cloud service with virtual machines assigned to an availability set:

    image

    Just like roles, Availability Sets have five UDs by default and support up to twenty. The FC spreads instances assigned to an Availability Set across UDs, as shown in the figure below. This allows customers to deploy Virtual Machines designed for high availability, for example two Virtual Machines configured for SQL Server mirroring, to an Availability Set, which ensures that a host update will cause a reboot of only one half of the mirror at a time as described here (I don’t discuss it here, but the FC also uses a feature called Fault Domains to automatically spread instances of roles and Availability Sets across servers so that any single hardware failure in the datacenter will affect at most half the instances).

    image

    More Information

    You can find more information about Update Domains, Fault Domains and Availability Sets in my Windows Azure conference sessions, recordings of which you can find on my Mark’s Webcasts page here. Windows Azure MSDN documentation describes host OS updates here and the service definition schema for Update Domains here.

  • The Case of the Veeerrry Slow Logons

    This case is my favorite kind of case, one where I use my own tools to solve a problem affecting me personally.  The problem at the root of it is also one you might run into, especially if you travel, and demonstrates the use of some Process Monitor features that many people aren’t aware of, making it an ideal troubleshooting example to document and share.

    The story unfolds the week before last when I made a trip to Orlando to speak at Microsoft’s TechEd North America conference. While I was there I began to experience five minute black-screen delays when I logged on to my laptop’s Windows 7 installation:

    image

    I’d typically chalk up an isolated delay like this to networking issues, common at conferences and with hotel WiFi, but I hit the issue consistently switching between the laptop’s Windows 8 installation, where I was doing testing and presentations, and the Windows 7 installation, where I have my development tools. Being locked out of your computer for that long is annoying to say the least.

    The first time I ran into the black screen I forcibly rebooted the system after a couple of minutes because I thought it had hung, but when the delay happened a second time I was forced to wait it out and face the disappointing reality that my system was sick. When I logged off and back on again without a reboot in between, though, I didn’t hit the delay. It only occurred when logging on after a reboot, which I was doing as I switched between Windows 7 and Windows 8. What made the situation especially frustrating was that whenever I rebooted I was always in a hurry to get ready for my next presentation, so had to suffer with the inconvenience for several days before I finally had the opportunity to investigate.

    Once I had a few spare moments, I launched Sysinternals Autoruns, an advanced auto-start management utility, to disable any auto-starting images that were located on network shares. I knew from previous executions of Autoruns on the laptop that Microsoft IT configures several scheduled tasks to execute batch files that reside on corporate network shares, so suspected that timeouts trying to launch them were to blame:

    image

    I logged off and logged back on with fingers crossed, but the delay was still there. Next, I tried logging into a local account to see if this was a machine-wide problem or one affecting just my profile. No delay. That was a positive sign since it meant that whatever the issue was, it would probably be relatively easy to fix once identified.

    My goal now was to determine what was holding up the switch to the desktop. I had to somehow get visibility into what was going on during a logon immediately following a boot. The way that immediately jumped to mind as the easiest was to use Sysinternals Process Monitor to capture a trace of the boot process. Process Monitor, a tool that monitors system-wide file system, registry, process, DLL and network operations, has the ability to capture activity from very early in the boot, stopping its capture only when the system shuts down or you run the Process Monitor user interface. I selected the boot logging entry from the Options and opened the boot logging dialog:

    SNAGHTML274998d

    The dialog lets you direct Process Monitor to collect profiling events while it’s monitoring the boot, which are periodic samples of thread stacks. I enabled one-second profiling, hoping that even if I didn’t spot operations that explained the delay, that I could get a clue from the stacks of the threads that were active just before or during the delay.

    After I rebooted, I logged on, waited for five minutes looking at a black screen, then finally got to my desktop, where I ran Process Monitor again and saved the boot log. Instead of scanning the several million events that had been captured, which would have been like looking for a needle in a haystack, I used this Process Monitor filter to look for operations that took more than one second, and hence might have caused the slow down:

    SNAGHTML27bf247

    Unfortunately, the filter cleared the display, dashing my hopes for quickly finding a clue.

    Wondering if perhaps the sequence of processes starting during the logon might reveal something, I opened the Process Tree dialog from the Tools menu. The dialog shows the parent-child relationships of all the processes active during a capture, which in the case of a boot trace means all the processes that executed during the boot and logon process. Focusing my attention on Winlogon.exe, the interactive logon manager, I noticed that a process named Atbroker.exe launched around the time I entered my credentials, and then Userinit.exe executed at the time my desktop finally appeared:

    image

    The key to the solving the mystery lay in the long pause in between. I knew that Logonui.exe simply displays the logon user interface and that Atbroker.exe is just a helper for transitioning from the logon user interface to a user session, which ruled them out, at least initially. The black screen disappeared when Userinit.exe had started, so Userinit’s parent process, Winlogon.exe, was the remaining suspect. I set a filter to include just events from Winlogon.exe and added the Relative Time column to easily see when events occurred relative to the start of the boot. When I looked at the resulting items I could easily see the delay was actually about six minutes, but there was no activity in that time period to point me at a cause:

    image 

    Profiling events are excluded by default, so I clicked on the profile event filter button in the toolbar to include them, hoping that they might offer some insight:

    image

    In order to minimize log file sizes, Process Monitor’s profiling only captures a thread’s stack if the thread has executed since the last time it was sampled. I therefore was expecting to have to look at the thread profile events at the start of the event, but my eye was drawn to a pattern of the same four threads sampled every second throughout the entire black-screen period:

    image

    I was fairly certain that whatever thread was holding things up had executed some function at the beginning of the interval and was dormant throughout, so was skeptical that any of these active threads were related to the issue, but it was worth spending a few seconds to look at them. I opened the event properties dialog for one of the samples by double-clicking on it and switched to its Stack page, on the off chance that the names of the functions on the stack had an answer.

    When I first run Process Monitor on a system I configure it to pull symbols for Windows images from the Microsoft public symbol server using the Debugging Tools for Windows debug engine DLL, so I can see descriptive function names in the stack frames of Windows executables, rather than just file offsets:

    SNAGHTML3eeaff2

    The first thread’s stack identified the thread as a core Winlogon “state machine” thread waiting for some unknown notification, yielding no clues:

    image

    The next thread’s stack was just as unenlightening, showing the thread to be a generic worker thread:

    image

    The stack of the third thread was much more interesting. It was many frames deep, including calls into functions of the Multiple UNC Provider (MUP) and Distributed File System Client (DFSC) drivers, both related to accessing file servers:

    image

    I scrolled down to see the frames higher on the stack and the name of one of the functions, WLGeneric_ActivationAndNotifyStartShell_Execute, pretty much confirmed the thread to be the one responsible for the problem, since it implied that it was supposed to start the desktop shell:

    image

    The next frame’s function, WNetRestoreAllConnectionsW, combined with the deeper calls into file server functions, led me to conclude that Winlogon was trying to restore file server drive letter mappings before proceeding to launch my shell and give me access to the desktop. I quickly opened Explorer, recalling that I had two drives mapped to network shares hosted on computers inside the Microsoft network, one to my development system and another to the internal Sysinternals share where I publish pre-release versions of the tools. While at the conference I was not on the intranet, so Winlogon was unable to reconnect them during the logon and was eventually – after many minutes – giving up:

    image

    Confident I’d solved the mystery, I right-clicked on each share and disconnected it. I rebooted the laptop to verify my fix (workaround to be precise), and to my immense satisfaction, the logon proceeded to the desktop within a few seconds. The case was closed! As for why the delays were unusually long, I haven’t had the time – or  need - to investigate further. The point of this story isn’t to highlight this particular issue, but illustrate the use of the Sysinternals tools and troubleshooting techniques to solve problems.

    TechEd Europe, which took place in Amsterdam last week, gave me another chance to reprise the talks I’d given at TechEd US. I delivered the same Case of the Unexplained troubleshooting session I had at TechEd US, but this time I had the pleasure of sharing this very fresh and personal case. You can watch it and my other TechEd sessions either by going to my webcasts page, which lists all of my major sessions posted online, or follow these links directly:

    Windows Azure Virtual Machines and Virtual Networks
    Windows Azure Internals
    Malware Hunting with the Sysinternals Tools
    Case of the Unexplained 2012

    And you can see all of both event’s sessions online at their webcast sites:

    TechEd North America 2012 On-Demand Recordings
    TechEd Europe 2012 On-Demand Recordings

    I hope you enjoyed this case!

  • Announcing Trojan Horse, the Novel!

    Many of you have read Zero Day, my first novel. It’s a cyberthriller that features Jeff Aiken and the beautiful Daryl Haugen, computer security experts that save the world from a devastating cyberattack. Its reviews and sales exceeded my expectations, so I’m especially excited about the sequel, Trojan Horse, which I think is even more timely and exciting. Trojan Horse, like Zero Day, is an action-packed cyberthriller on a global scale, pitting Jeff and Daryl against international forces in a fight for world security and their lives. Instead of telling you more, I’ll let the Trojan Horse video trailer, below, show you instead.

    Trojan Horse will be published on September 4, but you can preorder it now from your favorite online book seller (in the US only now, but Zero Day’s Korean publisher has already purchased foreign publishing rights). Find the ordering links, read more about Trojan Horse, see my other books, check out my book blog and find out where I’m speaking on my new website, markrussinovich.com. Preorder Trojan Horse now and tell your friends!

  • The Case of My Mom’s Broken Microsoft Security Essentials Installation

    As a reader of this blog I suspect that you, like me, are the IT support staff for your family and friends. And I bet many of you performed system maintenance duties when you visited your family and friends during the recent holidays. Every time I’m visiting my mom, I typically spend a few minutes running Sysinternals Process Explorer and Autoruns, as well as the Control Panel’s Program Uninstall page, to clean the junk that’s somehow managed to accumulate since my last visit.

    This holiday, though, I was faced with more than a regular checkup. My mom recently purchased a new PC, so as a result, I spent a frustrating hour removing the piles of crapware the OEM had loaded onto it (now I would recommend getting a Microsoft Signature PC, which are crapware-free). I say frustrating because of the time it took and because even otherwise simple applications were implemented as monstrosities with complex and lengthy uninstall procedures. Even the OEM’s warranty and help files were full-blown installations. Making matters worse, several of the craplets failed to uninstall successfully, either throwing error messages or leaving behind stray fragments that forced me to hunt them down and execute precision strikes.

    As my cleaning was drawing to a close, I noticed that the antimalware the OEM had put on the PC had a 1-year license, after which she’d have to pay to continue service. With excellent free antimalware solutions on the market, there’s no reason for any consumer to pay for antimalware, so I promptly uninstalled it (which of course was a multistep process that took over 20 minutes and yielded several errors). I then headed to the Internet to download what I – not surprisingly given my affiliation - consider the best free antimalware solution, Microsoft Security Essentials (MSE). A couple of minutes later the setup program was downloaded and the installation wizard launched. After clicking through the first few pages it reported it was going to install MSE, but then immediately complained that an “error has prevented the Security Essentials setup wizard from completing successfully.”:

    SNAGHTMLfe55b5c

    The suggestion to “restart your computer and try again” is intended to deal with failures caused by interference from an unfinished uninstall of existing antimalware (or a hope that whatever unexpected error condition caused the problem is transient). I’d just rebooted, so it didn’t apply. Clicking the “Get help on this issue” link provided some generic troubleshooting steps, like uninstalling other antimalware, ensuring that the Windows Installer service is configured and running (though by default it isn’t running on Windows 7 since it’s demand-start), and if all else fails, contacting customer support.

    I suspected that whatever I’d run into was rare enough that customer support wouldn’t be able to help (and what would they say if they knew Mark Russinovich was calling for tech support?), especially when I found no help on the web for error code 0x80070643. My brother in law, who is also a programmer and tech support for his neighborhood was watching over my shoulder to pick up some tips, so the pressure was on to fix the problem. Out came my favorite troubleshooting tool, Sysinternals Process Monitor (remember, “when in doubt, run Process Monitor”).

    I reran the MSE setup while capturing a trace with Process Monitor. Then I opened Process Monitor’s process tree view to find what processes were involved in the attempted install and identified Msiexec.exe (Windows Installer) and a few launcher processes. I also saw that Setup.exe launched Wermgr.exe, the Windows Error Reporting Manager, presumably to upload an error report to Microsoft:

    image

    I turned my attention back to the trace output and configured a filter that excluded everything but these processes of interest. Then I began the arduous job of working my way through tens of thousands of operations, hoping to find the needle in the haystack that revealed why the setup choked with error 0x80070643.

    As I scanned quickly to get an overall view, I noticed some writes to log files:

    image

    However, the messages in them revealed nothing more than the cryptic error message shown in the dialog.

    After a few minutes I decided I should work my way back from where in the trace operations the error occurred, so returned to the tree, selected Wermgr.exe, and clicked “Go to event”:

    image

    This would ideally be just after the setup encountered the fatal condition. Then I paged up in the trace, looking for clues. After several more minutes I noticed a pattern that accounted for almost all the operations up that point: Setup.exe was enumerating all the system’s installed applications. I determined that by observing it queried multiple installer-related registry locations, and I could see the names of the applications it found in the Details column for some of them. Here, for example, is one of the OEM’s programs, another help file-as-an-application, that I hadn’t bothered to uninstall:

    image

    I could now move quickly through the trace by scanning for application names. A minute later I stopped short, spotting something I shouldn’t have seen: “Microsoft Security Essentials”:

    image

    I knew I hadn’t seen it listed in the installed programs list in the Control Panel in my earlier uninstall-fest, which I confirmed by rechecking.

    Why were there traces of MSE when it hadn’t been installed, and in fact wouldn’t install? I don’t know for sure, but after pondering this for a few minutes I came to the conclusion that the software my mother had used to transfer files and settings from her old system had copied parts of the MSE installation she had on the old PC. She likely had used whatever utility the OEM put on PC, but I would recommend using Windows Easy Transfer. But the reason didn’t really matter at this point, just getting MSE to install successfully, and I believed I had found the problem. I deleted the keys, reran the setup, and….hit the same error.

    Not ready to give up, I captured another trace. Suspecting that setup was tripping on other fragments of the phantom installation, I searched for “security essentials” in the new trace and found another reference before the setup bailed. To avoid performing this step multiple more times, I went to the registry and performed the same search, deleting about two dozen other keys that had “security essentials” somewhere in them.

    I held my breath and ran the installer again, but no go:

    image

    The error code was different so I had apparently made some progress, but a web search still didn’t yield any clues. I captured yet another trace and began pouring through the operations. The install made it way past the installed application enumeration, generating tens of thousands of more operations. I scanned back from where Wermgr.exe launched, but was quickly overwhelmed. I just couldn’t spot what had made it unhappy, and that was assuming that whatever it was would be visible in the trace. My brother-in-law was growing skeptical, but I told him I wasn’t done. I was motivated by the challenge as much as the fact that I couldn’t let him tell his work buddies that he’d watched me fail.

    I decided I needed the guidance of a successful installation’s trace so that I could find where things went astray. When it’s an option, like it was here, side-by-side trace comparison is a powerful troubleshooting technique. I switched to my laptop, launched a Windows 7 virtual machine, and generated a trace of MSE’s successful installation on a clean system. I then copied the log from my mom’s computer and opened both traces in separate windows, one on the top of the screen and one on the bottom.

    Scrolling through the traces in tandem, I was able to synchronize them simply by looking at the shapes that the operation paths make in the window and occasionally ensuring that they were indeed in sync by looking closely at a few operations. Though it was laborious, I progressed through the trace, at times losing sync but then gaining it back. One trace being from a clean system and the other with lots of software installed caused relatively minor differences I could discount.

    Finally after about 10 minutes, I found an operation that differed in what seemed to be a significant way: an open of the registry key HKCR\Installer\UpgradeCodes\11BB99F8B7FD53D4398442FBBAEF050F returned SUCCESS in the failing trace:

    image 

    but NAME NOT FOUND in the working one:

    image

    Another bit of the broken installation it seemed, but without any reference to MSE, so one that hadn’t shown up in my registry search. I deleted the key, and with some forced confidence told my brother-in-law that I had solved the problem. Then I crossed my fingers and launched the setup again, praying that it would work and I could get back to the holiday festivities that were in full swing downstairs.

    Bingo, the setup chugged along for a few seconds and finished by congratulating me on my successful install:

    SNAGHTMLfe4b7cd

    Another seemingly unsolvable problem conquered with Sysinternals, application of a few useful troubleshooting techniques, and some perseverance. My brother-in-law was suitably impressed and had a good story for when he returned to the office after the break, and my mother had a faster PC with free antimalware service.

    I followed up with the MSE team and they are working on improving the error codes and making the setup program more robust against these kinds of issues. They also pointed me at some additional resources in case you happen to run into the same kind of problem. First, there’s a Microsoft support tool, MscSupportTool.exe, that extracts the MSE installation log files, which might give some additional information. There’s also a Microsoft ‘fix-it tool’ that addresses some installation corruption problems.

    I hope that your holiday troubleshooting met with similar success and wish that your 2012 is free of computer problems!

  • The Case of the Installer Service Error

    This case unfolds with a network administrator charged with the rollout of the Microsoft Windows Intune client software on their network. Windows Intune is a cloud service that manages systems on a corporate network, keeping their software up to date and enabling administrators to monitor the health of those systems from a browser interface. It requires a client-side agent, but on one particular system the client software failed to install, reporting this error message:

    image

    The dialog’s error message translates to “The Windows Installer Service could not be accessed. This can occur if the Windows Installer is not correctly installed. Contact your support personnel for assistance.”

    The administrator, having seen one of my Case of the Unexplained presentations where I advised, “when in doubt, run Process Monitor,” downloaded a copy from Sysinternals.com and captured a trace of the failed install. He followed the standard troubleshooting technique of looking backward from the end of the trace for operations that might be the cause, but after about a half hour of analysis he gave up and switched to a different approach. Instead of looking for clues in the trace, he thought he might be able to find clues by comparing the trace of the failing system to another captured on a system where the client installed successfully.

    A few minutes later he had a second trace to compare side-by-side. He set a filter to include only events generated by Msiexec.exe, the client setup program, and proceeded through the events in the trace from the problematic system, correlating them with corresponding events on the working one. He eventually got to a point where the two traces diverged. Both traces have references to HKLM\System\CurrentControlSet\Services\BFE, but the failed trace then has registry queries of the ComputerName registry key:

    image

    The working system’s trace, on the other hand, continues with operations from a new instance of Msiexec.exe, something he noticed because the process ID of the subsequent Msiexec.exe operations were different from the earlier ones:

    image

    It still wasn’t clear from the failed trace what caused the divergence, however. After pondering the situation for a few minutes he was just about to give up when the thought crossed his mind that the answer might lie in the operations that the filter was hiding. He deleted the filter he’d applied that included only events from Msiexec.exe from both traces and resumed comparing traces from the point of divergence.

    He immediately saw that the trace from the working system had many events from a Svchost.exe process that weren’t present in the failed trace. Working under the assumption that the Svchost.exe activity was unrelated, he added a filter to exclude it. Now the traces lined up again with events from Services.exe matching in both traces:

    image image

    The matching operations didn’t go on for very long, however. Only a dozen or so operations later the trace from the failing system had a reference to HKLM\System\CurrentControlSet\Services\Msiserver\ObjectName with a NAME NOT FOUND error:

    image

    The trace from the working system had the same operation, but with a SUCCESS result:

    image

    Sure enough, right-clicking on the path and selecting “Jump to…” from Process Monitor’s context menu confirmed that not only was the ObjectName value missing from the failing system’s Msiserver key, but the entire key was empty. On the working system it was populated with the registry values required to configure a Windows service:

    image

    Somehow the service registration for the MSI service had been corrupted, something the initial error dialog had stated but without guidance for how to fix the problem. And how the service had been corrupted would likely remain forever a mystery, but the important thing now was fixing it. To do so, he simply used Regedit’s export functionality to save the contents of the key from the working system to a .reg file and then imported the file to the corrupted system. After the import, he reran the Microsoft Intune installer and it succeeded without any issues.

    With the help of Process Monitor and some diligence, he’d spent about forty-five minutes fixing a problem that would have ended up costing him several hours if he’d had to reimage the system and restore its applications and configuration.

    You can find more tips on running Process Monitor, as well as additional illustrative troubleshooting cases, in my Windows Sysinternals Administrator’s Reference, a book I recently published with Aaron Margosis. If you’ve read it, please leave a review on Amazon.com.

  • Fixing Disk Signature Collisions

    Disk cloning has become common as IT professionals virtualize physical servers using tools like Sysinternals Disk2vhd and use a master virtual hard disk image as the base for copies created for virtual machine clones. In most cases, you can operate with cloned disk images unaware that they have duplicate disk signatures. However, on the off chance you attach a cloned disk to a Windows system that has a disk with the same signature, you will suffer the consequences of disk signature collision, which renders unbootable any of the disk’s installations of Windows Vista and newer. Reasons for attaching a disk include offline injection of files, offline malware scanning , and - somewhat ironically - repairing a system that won’t boot. This risk of corruption is the reason that I warn in Disk2vhd’s documentation not to attach a VHD produced by Disk2vhd to the system that generated the VHD using the native VHD support added in Windows 7 and Windows Server 2008 R2.

    I’ve gotten emails from people that have run into the disk signature collision problem and see from a Web search that there’s little clear help for fixing it. So in this post, I’ll give you easy repair steps you can follow if you’ve got a system that won’t boot because of a disk signature collision. I’ll also explain where disk signatures are stored, how Windows uses them, and why a collision makes a Windows installation unbootable.

    Disk Signatures

    A disk signature is four-byte identifier offset 0x1B8 in a disk’s Master Boot Record, which is written to the first sector of a disk. This screenshot of a disk editor shows that the signature of my development system’s disk is 0xE9EB3AA5 (the value is stored in little-endian format, so the bytes are stored in reverse order):

    image

    Windows uses disk signatures internally to map objects like volumes to their underlying disks and starting with Windows Vista, Windows uses disk signatures in its Boot Configuration Database (BCD), which is where it stores the information the boot process uses to find boot files and settings. When you look at a BCD’s contents using the built-in Bcdedit utility, you can see the three places that reference the disk signature:

    image

    The BCD actually has additional references to the disk signature in alternate boot configurations, like the Windows Recovery Environment, resume from hibernate, and the Windows Memory Diagnostic boot, that don’t show up in the basic Bcdedit output. Fixing a collision requires knowing a little about the BCD structure, which is actually a registry hive file that Windows loads under HKEY_LOCAL_MACHINE\BCD00000:

    image

    Disk signatures show up at offset 0x38 in registry values called Element under keys named 0x11000001 (Windows boot device) and 0x2100001 (OS load device):

    image

    Here’s the element corresponding to one of the entries seen in the Bcdedit output, where you can see the same disk signature that’s stored in my disk’s MBR:

    image

    Disk Signature Collisions

    Windows requires the signatures to be unique, so when you attach a disk that has a signature equal to one already attached, Windows keeps the disk in “offline” mode and doesn’t read its partition table or mount its volumes. This screenshot shows how the Windows Disk Management administrative utility presents an offline disk that I caused when I attached the VHD Disk2vhd created for my development system to that system:

    image

    If you right-click on the disk, the utility offers an “Online” command that will cause Windows to analyze the disk’s partition table and mount its volumes:

    image

    When you chose the Online menu option, Windows will without warning generate a new random disk signature and assign it to the disk by writing it to the MBR. It will then be able to process the MBR and mount the volumes present, but when Windows updates the disk signature, the BCD entries become orphaned, linked with the previous disk signature, not the new one. The boot loader will fail to locate the specified disk and boot files when booting from the disk and give up, reporting the following error:

    image

    Restoring a  Disk Signature

    One way to repair a disk signature corruption is to determine the new disk signature Windows assigned to the disk, load the disk’s BCD hive, and manually edit all the registry values that store the old disk signature. That’s laborious and error-prone, however. In some cases, you can use Bcdedit commands to point the device elements at the new disk signature, but that method doesn’t work on attached VHDs and so is unreliable. Fortunately, there’s an easier way. Instead of updating the BCD, you can give the disk its original disk signature back.

    First, you have to determine the original signature, which is where knowing a little about the BCD becomes useful. Attach the disk you want to fix to a running Windows system. It will be online and Windows will assign drive letters to the volumes on the disk, since there’s no disk signature collision. Load the BCD off the disk by launching Regedit, selecting HKEY_LOCAL_MACHINE, and choosing Load Hive from the File menu:

    image

    Navigate to the disk’s hidden \Boot directory in the file dialog, which resides in the root directory of one of the disk’s volumes, and select the file named “BCD”. If the disk has multiple volumes, find the Boot directory by just entering x:\boot\bcd, replacing the “x:” with each of the volume’s drive letters in turn. When you’ve found the BCD, pick a name for the key into which it loads, select that key, and search for “Windows Boot Manager”. You’ll find a match under a key named 12000004, like this:

    image

    Select the key named 11000001 under the same Elements parent key and note the four-byte disk signature located at offset 0x38 (remember to reverse the order of the bytes).

    With the disk signature in hand, open an administrative command prompt window and run Diskpart, the command-line disk management utility. Enter “select disk 2”, replacing “2” with the disk ID that the Disk Management utility shows for the disk. Now you’re ready for the final step, setting the disk signature to its original value with the command “uniqueid disk id=e9eb3aa5”, substituting the ID with the one you saw in the BCD:

    image

    When you execute this command, Windows will immediately force the disk and its corresponding volumes offline to avoid a signature collision. Avoid bringing the disk online again or you’ll undo your work. You can now detach the disk and because the disk signature matches the BCD again, Windows installations on the disk will boot successfully. You might find yourself in a situation where you have no choice but to cause a collision and have Windows update a disk signature, but at least now you know how to repair it when you do.

    You can find out more about Disk2vhd in the Sysinternals Administrator’s Reference by me and Aaron Margosis.

  • The Case of the Mysterious Reboots

    This case opens when a Sysinternals power user, who also works as a system administrator at a large corporation, had a friend report that their laptop had become unusable. Whenever the friend connected it to a network, their laptop would reboot. The power user, upon getting hold of the laptop, first verified the behavior by connecting it to a wireless network. The system instantly rebooted, first into safe mode, then again back into a normal Windows startup. He tried booting the laptop into safe mode directly, hoping that whatever was causing the problem would be inactive in that mode, but logging on only resulted in an automatic logoff. Returning to a normal boot, he noticed that Microsoft Security Essentials (MSE) was installed and tried to launch it. Double-clicking the icon had no effect, however, and double-clicking its entry in the Programs and Features section of the Control Panel resulted in an error message:

    image

    Hovering his mouse over the MSE icon in the start menu gave the explanation: the link was pointing at a bogus location likely created by malware:

    image

    Because he couldn’t get to the network, he couldn’t easily repair the corrupted MSE installation. Wondering if the Sysinsternals tools might help, he copied Process Explorer and Autoruns to a USB key, and then copied them from the key to the laptop, which he was now convinced was infected. Launching Process Explorer, he was greeted with the following process tree:

    image

    In my Blackhat presentation, Zero Day Malware Cleaning with the Sysinternals Tools, I present a list of characteristics commonly exhibited by malicious processes. They include having no icon, company name, or description, residing in the %Systemroot% or %Userprofile% directories, and being “packed” (encrypted or compressed). While there’s a class of sophisticated malware that doesn’t have any of these attributes, most malware still does. This case is a great example. Process Explorer looks for the signatures of common executable compression tools like UPX, as well as heuristics that include Portable Executable image layouts used by compression engines, and highlights matches in a “packed” highlight color. The default color, fuchsia, is visible on about a dozen processes in the process view. Further, every single one of the highlighted processes lacks a description and company name (though a few have icons).

    Many of them also have names that are the same, or similar to, legitimate Windows system executables. The one highlighted below has a name that matches the Windows Svchost.exe executable, but has an icon (borrowed from Adobe Flash) and resides in a nonstandard directory, C:\Windows\Update.1:

    image

    Another process with a name not matching that of any Windows executable, but whose name, Svchostdriver.exe, is similar enough to confuse someone not intimately familiar with Windows internals, actually has TCP/IP sockets listening for connections, presumably from a botmaster:

    image

    There was no question that the computer was severely infected. Autoruns revealed malware using several different activation points, and explained that the reason even Safe Mode with Command Prompt didn’t work properly was because a bogus executable called Services32.exe (another legitimate-looking name) had registered as the Safe Mode AlternateShell, which is by default Cmd.exe (command prompt):

    image

    My recommendation for cleaning malware is to first leverage antimalware utilities if possible. Antimalware might address some or all of an infection, so why do work if you don’t have to? But this system couldn’t connect to the Internet, preventing easily repairing the MSE installation or downloading other antimalware like the Microsoft Malicious Software Removal Tool (MSRT). The power user had seen me use the Process Explorer suspend functionality at a conference to suspend malware processes in order to prevent them from restarting each other when someone trying to clean the system terminates one. Maybe if he suspended all the processes that looked malicious he’d be able to connect to the network without having the system reboot? It was worth a shot.

    Right-clicking on each malicious process in turn, he selected Suspend from the context menu to put the process into a state of limbo:

    image

    When he was done, the process tree looked like this, with suspended processes colored grey:

    image

    Now to see if the trick worked: he connected to the wireless network. Bingo, no reboot. Now connected to the Internet, he proceeded to download MSE, install it, and perform a thorough scan of the system. The engine cranked along, reporting discovered infections as it went. When it finished, it had found four separate malware strains, Trojan:Win32/Teniel, Backdoor:Win32/Bafruz.C, Trojan:Win32/Malex.gen!E, and Trojan:Win32/Sisron:

    image

    After rebooting, which was noticeably faster than before, he connected to the network without trouble. As a final check, he launched Process Explorer to see if any suspicious processes remained. To his relief, the process tree looked clean:

    image

    Another case solved with the help of the Sysinternals tools! The new Windows Sysinternals Administrator’s Reference, authored by Aaron Margosis and me, covers all the tools and their features in detail, giving you the tools and techniques required to solve problems related to sluggish performance, misleading error messages, and application crashes. And if you’re interested in cyber-security, be sure to get a copy of my technothriller Zero Day.

  • The Case of the Hung Game Launcher

    I love the cases people send me where the Sysinternals tools have helped them successfully troubleshoot, but nothing is more satisfying than using them to solve my own cases. This case in particular was fun because, well, solving it helped me get back to having fun.

    When I have time, I occasionally play PC games to let off steam (pun intended, as you’ll see). One of my favorites over the last few years was the puzzle game, Portal. I enjoyed the first Portal so much that I pre-ordered Portal 2 on Valve’s Steam network when it became available and played through it within a few hours of its release. Since then, I’ve been playing community-developed maps. Last Saturday I started a particularly fun map, a winner from a community map contest, but didn’t have time to finish it in one sitting. The next morning I returned to my PC, double-clicked on the Portal 2 desktop icon, and got the standard Steam launch dialog. The game normally launches in a couple of seconds, but this time the dialog just sat there:

    image

    I killed Steam and double-clicked again, but again the dialog hung. I captured a Process Monitor trace and looked at the stacks of Steam’s threads in Process Explorer, but didn’t see any clues. Figuring that perhaps Portal 2’s configuration or installation had somehow been corrupted, I deleted Portal 2, re-downloaded it, and reinstalled it. That didn’t fix the problem, though. With Portal 2 reset to a clean state, that left either Steam or some general Windows issue to blame. The next step was therefore to reinstall Steam.

    I first went to the Uninstall or Change a Program page in the Control Panel, but double-clicking on the Steam entry brought up a dialog asking me to confirm uninstalling it and warning that doing so would delete all local content.  I didn’t want to risk losing my game settings or have to reinstall all my games, so I aborted the uninstall. Most Microsoft Installer Service (MSI)-based installers have a repair option that reinstalls the application without deleting user data or configuration, so I went to the Steam home page, downloaded and executed the Steam installer. Sure enough, the install wizard offered the repair option:

    image

    When I pressed the Next button, though, I was greeted with an obviously misleading error message that reported a network error while trying to read from a local file:

    image

    I turned to Process Monitor again and captured a trace of the failed repair operation. The error message referred to a file named SteamInstall[1].msi, so I searched the log file for that string. The first hit was the data value read in a query of a registry value under HKCR\Installer\Products named PackageName:

    image

    The next hits, a few operations later, were attempts by the installer to read from the file location printed in the error dialog:

    image

    That the installer was reading the file name from an existing registry key and the file’s location was in Internet Explorer’s (IE’s) download cache suggested that it was trying to launch the copy of itself that had performed the initial install. Because I had originally launched the installer via IE directly from the Valve web site, just like I was doing now, the download location was in IE’s download cache, but the file must have aged out and been deleted since then.

    The Process Monitor trace revealed that the installer was reading the original location from the registry, so if I pointed the registry at the installer’s new download location, I could trick it into launching itself, rather than looking for the previous copy that was now missing. I scanned the log for the new download location by searching for Steaminstall.msi and found its path, another download cache location:

    image

    I then went back to the registry query’s entry, right-clicked on it, and selected “Jump To” from the context menu. That caused Process Monitor to launch Regedit and navigate directly to the registry key, where I updated the LastUsedSource and PackageName values to reflect the new download location:

    image

    Next, I dismissed the installer’s error dialog, which I had left open, and pressed the wizard’s Next button to try the repair again. This time, Steam proceeded to reinstall and the wizard concluded with a message of success:

    image

    Crossing my fingers, I launched Portal 2. Steams’s ‘Preparing to Launch’ dialog flashed for a second and then Portal 2’s splash screen took over the screen: case closed.  Uninstalling and then reinstalling Steam and all the games would have likely lead to the same conclusion, but Process Monitor had surely saved me a lot of time and possibly even my saved game state and configurations. In just a few minutes I was back to solving puzzles of a different kind.

    Check out the new Windows Sysinternals Administrator’s Reference by me and Aaron Margosis for more tips on using all 70+ Sysinternals tools to troubleshoot and manage your Windows systems! Buy a copy by August 15, email the receipt to me at markruss@microsoft.com and I’ll enter you for a drawing of one of 10 signed copies of Zero Day I’m giving away.

    Mark Russinovich is a Technical Fellow on the Windows Azure team at Microsoft and is author of Windows Internals, The Windows Sysinternals Administrator’s Reference, and the cyberthriller Zero Day: A Novel. You can contact him at markruss@microsoft.com.

  • Troubleshooting with the New Sysinternals Administrator’s Reference

    image

    Aaron Margosis and I are thrilled to announce that the long awaited, and some say long overdue, official guide to the Sysinternals tools is now available! I’ve always had the idea of writing a book on the tools in the back of my mind, but it wasn’t until a couple of years ago that Dave Solomon, my coauthor on Windows Internals, convinced me to pursue it. After a few false starts, I decided that a coauthor would help get the book done more quickly, and turned to Aaron, a good friend of mine who’s also a long-time user and expert on the tools at his day job in the Federal Division of Microsoft Consulting Services. It was a great choice and I’m proud to put the Sysinternals brand on the book. 

    Whether you’re new to the tools or have been using them since Bryce Cogswell (my Sysinternals and Winternals Software cofounder, now retired) and I released NTFSDOS in 1996, you’re sure to take away new insights that will give you the edge when tackling tough problems and managing your Windows systems.

    The book covers all 70+ tools, with chapters dedicated to the major tools like Process Explorer, Process Monitor, and Autoruns. For each we provide a thorough tour of all of the tool’s features, how to use the tool, and include our favorite tips and techniques. There’s no better way to learn than by example, though.  The last section of the book will be familiar to anyone that’s read this blog or watched my Case of the Unexplained conference sessions, because it presents 17 real-world cases that show how Windows power users and administrators like you solved otherwise impossible-to-solve problems by using the tools.

    The book is available for purchase on Amazon.com and available from O'Reilly in 4 ebook formats, or you can read it online through Safari.

    The eBook has only been out for a couple of weeks and we’ve already heard from someone who bought the book and immediately used what he learned to solve a case that was literally ruining his sleep. I thought it only appropriate to include it here in the blog post announcing the book.

    Let us know what you think of the book by dropping us an email, and as I say my dedication to you - my fellow Windows troubleshooters - at the front of the book, never give up, never surrender!

    The Case of the Mysterious Sounds

    The case opened several weeks ago when a user started hearing sounds from the computer in his bedroom. The sound, a simple short tone, came randomly, sometimes only once per day, other times a few times in an hour. Every time he heard it, he’d jump to the computer, open Process Explorer, and look for clues as to what might be responsible, but the sounds persisted even when he had no applications open. On a few occasions he was woken from sleep and learned to mute the speaker before heading to bed. His life began to unravel from his lack of sleep and growing frustration. Work suffered, he was short with his friends, and he started to wonder if he had a ghost.

    Then last week he saw the announcement that the Sysinternals book was available. He had been a casual user of the tools and thought that getting a deeper understanding might help his IT management responsibilities at work. When he reached the chapter on Process Monitor, he read that many years ago Dave Solomon found Process Monitor so useful at uncovering root causes to such a wide array of problems, that he coined the phrase “when in doubt, run Process Monitor.” With little to lose, he decided to give the advice a try on his haunted home system.

    He configured a filter for files ending in .WAV, hypothesizing that the sound was stored in that common format. Since he didn’t know how long it would take for a sound to reoccur, he needed to leave Process Monitor running for many hours. So that it wouldn’t exhaust the system’s virtual memory or fill up the disk, he used its “drop filtered events” feature to only record events matching the active filter. He left Process Monitor running and went to work. When he arrived home, he eagerly went to the computer to see if the culprit had been caught. Almost collapsing with relief, he saw eight operations had matched the filter:

    image

    The tooltip clearly revealed that the wireless adapter’s applet had played a sound. Then it all clicked: the computer was just in range of the wireless base station, so while it had a decent connection most of the time, occasionally the connection would drop. He suspected that the applet chimed to announce when the connection was restored. Expecting that it would offer an option to disable the notification, he right-clicked on the tray icon. Sure enough, “Enable Internet Connected Notification” was checked:

    image

    Since he unchecked it, the computer hasn’t made any unexpected noises and the case was closed. As a result, his sleep has returned to normal, he’s getting along with his friends, and his use of what he’s learned from the Sysinternals Administrator’s Reference has made him a star at work.

    Mark Russinovich is a Technical Fellow on the Windows Azure team at Microsoft and is author of Windows Internals, The Windows Sysinternals Administrator’s Reference, and the cyberthriller Zero Day: A Novel. You can contact him at markruss@microsoft.com.

  • Analyzing a Stuxnet Infection with the Sysinternals Tools, Part 3

    In the first post of this series, I used Autoruns, Process Explorer and VMMap to statically analyze a Stuxnet infection on Windows XP. That phase of the investigation revealed that Stuxnet infected multiple processes, launched infected processes that appeared to be running system executables, and installed and loaded two device drivers. In the second phase, I turned to the Process Monitor trace I had captured during the infection and learned that Stuxnet had launched several additional processes during the infection. The trace also uncovered the fact that Stuxnet had dropped four files with the .PNF extension into the C:\Windows\Inf directory. In this concluding post, I use the Sysinternals tools to try to determine the purpose of the PNF files and to look at how Stuxnet used a zero-day vulnerability on Windows 7 (since fixed) to elevate itself to run with administrator rights.

    The .PNF Files

    My first step in gathering clues about the .PNF files was to just see how large they were. Tiny files would probably be data and larger ones code. The four .PNF files in question are the following, listed with the sizes in bytes I observed in Explorer:

    MDMERIC3.PNF 90
    MDMCPQ3.PNF 4,943
    OEM7A.PNF 498,176
    OEM6C.PNF 323,848

    I also dumped the printable characters contained within the files using the Sysinternals Strings utility, but saw no legible words. That wasn’t surprising, however, because I expected the files to be compressed or encrypted.

    I thought that by looking at the way Stuxnet references the .PNF files, I might find additional clues regarding their purpose. To get a more complete view of their usage, I captured a Process Monitor boot log of the system rebooting after the infection. Boot logging, which you configure by selecting Enable Boot Logging in the Options menu, has Process Monitor capture activity from very early in the next boot and stop capturing either when you run Process Monitor again, or when the system shuts down:

    image_thumb4

    After capturing a boot log that included me logging back into the system, I loaded the boot log into one Process Monitor window and the initial infection trace into a second Process Monitor window. Then I reset the filters in both traces, removed the advanced filter that excludes System process activity, and added an inclusion filter for Mdmeric3.pnf to see all activity directed at the first file. The infection trace had the events related to the initial creation of the file and nothing more, and the file wasn’t referenced at all in the boot log. It appeared that Stuxnet didn’t leverage the file during the initial infection or in its subsequent activation. The file’s small size, 90 bytes, implies that it is data, but I couldn’t determine its purpose based on the little evidence I saw in the logs. In fact, the file may serve no useful purpose since none of the published Stuxnet reports have anything further to say about the file other than that it’s a data file.

    Next, I repeated the same filtering exercise for Mdmcpq3.pnf. In the infection log, I had seen the Services.exe process write the file’s contents three times during the initial infection, but there were no accesses afterward. In the boot trace, I could see Services.exe read the file immediately after starting:

    image_thumb11

    The fact that Stuxnet writes the file during the infection and reads it once when it activates during a system boot, coupled with the file’s relatively small size, hints that it might be Stuxnet configuration data, and that’s what formal analysis by antivirus researchers has concluded.

    The third file, Oem7a.pnf, is the largest of the files. I saw during my analysis of the infection log in the last post that after the rogue Lsass.exe writes the file during the infection, one of the other rogue Lsass.exe instances reads it in its entirety, as does the infected Services.exe process. An examination of the boot log showed that Services.exe reads the entire file when it starts:

    image

    What’s unusual is that the read operations are the very first performed by Services.exe, even before the Ntdll.dll system DLL loads. Ntdll.dll loads before any user-mode code executes, so seeing activity before then can only mean that kernel-mode code is responsible. The stack shows that they are actually initiated by Mrxcls.sys, one of the Stuxnet drivers, from kernel mode:

    image_thumb3

    The stack shows that Mrxcls.sys is invoked by the PsCallImageNotifyRoutines kernel function. That means Mrxcls.sys called PsSetLoadImageNotifyRoutine so that Windows would call it whenever an executable image, such as a DLL or device driver, is mapped into memory. Here, Windows was notifying the driver that the Services.exe image file was loading into memory to start the Services.exe process. Stuxnet clearly registers with the callback so that it can watch for the launch of Services.exe. Ironically, Process Monitor also uses this callback functionality to monitor image loads.

    These observations point at Mrxcls.sys as the driver that triggers the infection of user-mode processes when the system boots after the infection. Further, the size of the file, 498,176 bytes (487 KB), almost exactly matches the size of the virtual memory region, 488 KB, from where we saw Stuxnet operations initiate in Part 1 of the investigation. That region held an actual DLL, so it appears that Oem7a.pnf is the encrypted on-disk form of the main Stuxnet DLL, a hypothesis that’s confirmed by antimalware researchers.

    The final file, Oem6c.pnf, is not referenced at all in the boot trace. The only accesses in the infection trace are writes from the initial Lsass.exe process that also writes the other files. Thus, this file is written during the initial infection, but apparently never read. There are several potential explanations for this behavior. One is that the file might be read under specific circumstances that I haven’t reproduced in my test environment. Another is that it is a log file that records information about the infection for collection and review by Stuxnet developers at a later point. It’s not possible to tell from the traces, but antimalware researchers believe that it is a log file.

    Windows 7 Elevation of Privilege

    Many of the operations performed by Stuxnet, including the infection of system processes like Services.exe and the installation of device drivers, require administrative rights. If Stuxnet failed to infect systems with users lacking those rights, its ability to spread would have been severely hampered, especially into the sensitive networks it seems to have been targeting where most users likely run with standard user rights. To gain administrative rights from standard-user accounts, Stuxnet took advantage of two zero-day vulnerabilities.

    On Windows XP and Windows 2000, Stuxnet used an index checking bug in Win32k.sys that could be triggered by loading specially-crafted keyboard layout files(fixed in MS10-073). The bug allowed Stuxnet to inject code into kernel-mode and run with kernel privileges. On Windows Vista and newer, Stuxnet used a flaw in the access protection of scheduled task files that enabled it to give itself administrative rights (fixed in MS10-92). Standard users can create scheduled tasks, but those tasks should only be able to run with the same privileges as the user that created them. Before the bug was fixed, Windows would create the file storing a task with permissions that allowed standard users to modify the file. Stuxnet took advantage of the hole by creating a new task, setting the flag in the resulting task file that specifies that the task should run in the System account, which has full administrative rights, and then launching the task.

    To watch Stuxnet exploiting the Windows 7 bug, I started by uninstalling the related patch on a test system and monitored a Stuxnet infection with Process Monitor. After capturing the trace, I followed the same steps I described in the last post of setting a filter that discarded all operations except those that modify files and registry keys (“Category Is Write”), and then methodically excluding unrelated events. When I was finished the Process Monitor window looked like this:

    image

    The first events are Stuxnet dropping the temporary files that it later copies to PNF files in the C:\Windows\Inf directory. Those are followed by Svchost.exe events that are clearly related to the Task Scheduler service. The Scvhost.exe process creates a new scheduled task file in C:\Windows\System32\Tasks and then sets some related registry values. Stack traces of the events show that Schedsvc.dll, the DLL that implements the Task Scheduler service, is responsible:

    image

    A few operations later, Explorer writes some data to the new task file:

    image

    This is the operation that shouldn’t be possible, since a standard user account should not be able to manipulate a system file. We saw in the last post that the <unknown> frames in the stack of the operation show that Stuxnet is at work:

    image

    The final operations in the trace associated with the task file are those of the Task Scheduler deleting the file, so Stuxnet apparently modifies the task, launches it, and then deletes it:

    image

    To verify that the Task Scheduler in fact launches the task, I removed the write filter and applied another filter that included only references to the task file. That made an event appear in the display that shows Svchost.exe read the file after Stuxnet wrote to the file:

    image

    As a final confirmation, I looked at the operation’s stack and saw the Task Scheduler service’s SchRpcEnableTask function, whose name implies that it’s related to task activation:

    image

    Stuxnet Revealed by the Sysinternals Tools

    In this concluding segment of my Stuxnet investigation, I was able to use Process Monitor’s boot logging feature to gather clues pointing to the purpose of the various files Stuxnet drops on a system at the time of infection. Process Monitor also revealed the method by which Stuxnet used a flaw in the Task Scheduler service on Windows 7 to give itself administrative rights.

    This blog post series shows how the Sysinternals tools can provide an overview of malware infection and subsequent operation, as well as present a guide for cleaning an infection. They showed many of the key aspects of Stuxnet’s behavior with relative ease, including the launching of processes, dropping of files, installation of device drivers and elevation of privilege via the task scheduler. As I pointed out at the beginning of Part 1, a professional security researcher’s job would be far from done at this point, but the view given by the tools provides an accurate sketch of Stuxnet’s operation and a framework for further analysis. Static analysis alone would make gaining this level of comprehension virtually impossible, certainly within the half hour or so it took me using the Sysinternals tools.

    Mark Russinovich is a Technical Fellow on the Windows Azure team at Microsoft and is author of Windows Internals, The Windows Sysinternals Administrator’s Reference, and the cyberthriller Zero Day: A Novel. You can contact him at markruss@microsoft.com.

  • The Zero Day Book Trailer

    I just got back the finished version of the video trailer for my new cyber thriller Zero Day, which I think came out awesome! It’s not hard to imagine what a Zero Day movie trailer would look like. Let me know what you think.

    Zero Day Book Trailer
  • Analyzing a Stuxnet Infection with the Sysinternals Tools, Part 2

    In Part 1 I began my investigation of an example infection of the infamous Stuxnet worm with the Sysinternals tools. I used Process Explorer, Autoruns and VMMap for a post-infection survey of the system. Autoruns quickly revealed the heart of Stuxnet, two device drivers named Mrxcls.sys and Mrxnet.sys, and it turned out that disabling those drivers and rebooting is all that’s necessary to disable Stuxnet (barring a reinfection). With Process Explorer and VMMap we saw that Stuxnet injected code into various system processes and created processes running system executables to serve as additional hosts for its payload. By the end of the post I had gotten as far as I could with a snapshot-based view of the infection, however. In this post I continue the investigation by analyzing the Process Monitor log I captured during the infection to gain deeper insight into Stuxnet’s impact on an infected system and how it operates (incidentally, if you like these blog posts, cybersecurity, and books by Tom Clancy and Michael Crichton, be sure to check out my new cyberthriller, Zero Day).

    Filtering to Find Relevant Events

    Process Monitor captured around 30,000 events while monitoring the infection, which is an overwhelming number of events to individually inspect for clues. Most of the trace actually consists of background Windows activity and operations related to Explorer navigating to a new folder and are not directly related to the infection. Because by default Process Monitor excludes advanced events (paging file, internals IRP functions, System process and NTFS metadata operations), as the status bar indicates, Process Monitor is still showing over 10,000:

    image

    The key to using Process Monitor effectively when you don’t know what exactly you’re looking for is to narrow the amount of data to something manageable. Filters are a powerful way to do that and Process Monitor has a filter tailor made for these kinds of scenarios: a filter that excludes all events except ones that modify files or registry keys. You can configure this filter, “Category is Write then Include,” using the Filter dialog:

    SNAGHTML32d6f75

    Events generated by the System process are typically not relevant in troubleshooting cases, but I know that Stuxnet has kernel-mode components, so to be thorough I had to include events executed in the context of the System process, which is the process in which some device drivers execute system threads. You can remove the default filters by checking the Enable Advanced Output option on the filter menu, but I didn’t want to remove the other default filters that omit pagefile and NTFS metadata operations, so I removed just the System exclusion filter (the second one in the above filter list). The event count was down to 600:

    image

    The next step was to exclude events I knew weren’t related to the infection. Recognizing irrelevant events takes experience because it requires familiarity with typical Windows activity. For example, the first few hundred events of the remaining operations consisted of Explorer referencing values under the HKCU\Software\Microsoft\Windows\ShellNoRoam\BagsMRU registry key:

    image

    This key is where Explorer stores state for its windows, so I could exclude them. I did so by using Process Monitor’s “quick filters” feature: I right-clicked on one of the registry paths to bring up the quick filter context menu, and selected the Exclude filter:

    image

    Because I want to exclude any references to the key’s subkey’s or values, I opened the newly created filter, double-clicked on it to move it to the filter editor and changed “is” to “begins with”:

    image

    That reduced the event count to 450, which is a more reasonable number, but I saw still more events that I could exclude. The next set of events were the System process reading and writing registry hive files. Hive files store registry data, but it’s the registry operations themselves that are interesting, not the underlying reads and writes to the hive files. Excluding those reduced the event count to 350. I continued looking through the log, adding additional filters to exclude other extraneous events. After I was done filtering out all the background operations, the Filter dialog looked like this (some of the filters I added aren’t visible in the screenshot):

    image

    Now there were only 133 events and a quick glance through them confirmed that they were all probably related to Stuxnet. It was time to start deciphering them.

    Stuxnet System Modifications

    The first event in the remaining list shows Stuxnet, operating in the context of Explorer, apparently overwriting the first 4K of one of its two initial temporary files.

    image

    To verify that the write was indeed initiated by Stuxnet and not Explorer.exe, I double-clicked on the operation to open the Event Properties dialog and switched to the Stack page. The stack frame directly above the NtWriteFile API shows “<unknown>” as the Module name, which is Process Monitor’s indication that the stack address doesn’t lie in any of the DLLs loaded into the process:

    image

    If you are looking at stacks with third-party code you may also see <unknown> entries when the code doesn’t use standard calling conventions, because that interferes with the algorithm used by the stack tracing API on which Process Monitor relies. However, when I looked at Explorer’s address space with VMMap, I found a data region containing the unknown stack address 0x2FA24D5 that has both write and execution permissions, a telltale sign of virus-injected code:

    image

    The operations following those of Explorer.exe’s are those of an Lsass.exe process creating four files - ~Dfa.tmp, ~Dfb.tmp, ~Dfc.tmp and ~Dfd.tmp - in the account’s temporary directory. Many components in Windows create temporary files, so I had to verify that these were related to Stuxnet and not to standard Windows activity. A strong hint that Stuxnet was behind them is the fact that the process ID (PID) of the Lsass.exe process, 300, doesn’t match the PID of the system’s actual Lsass.exe process, which I identified in Part 1. In fact, the PID doesn’t match any of the three Lsass.exe processes that were running after the infection, confirming that it’s another rogue Lsass.exe process launched by Stuxnet.

    To see how this Lsass.exe process relates to the others, I typed Ctrl+T to open the Process Monitor process treeview dialog (it can also be opened from the Tools menu). The process tree reveals that three additional Lsass.exe processes executed during the infection, including the one with a PID of 300. Their greyed icons in the treeview indicate that they exited before the Process Monitor capture stopped:

    image

    I now knew that this was a rogue Lsass.exe process, but I had to verify that these temporary files weren’t just created by routine Lsass.exe activity. Again, I looked at their stacks and saw the <unknown> module marker like I had seen in the Explorer.exe operation’s stack.

    The next batch of entries in the trace are where things really get interesting, because we see Lsass.exe drop one of the two Stuxnet drivers, MRxCls.sys, in C:\Windows\System32\Drivers and create its corresponding registry keys:

    image

    I double-clicked on the WriteFile operation to see its stack and observed that the call to the CopyFileEx API meant that Stuxnet copied the driver’s contents from another file:

    image

    To see the file that served as the source of the copy, I temporarily disabled the write category exclusion filter by unchecking it in the filter dialog:

    image

    That revealed references to the ~DFD.tmp file that was created earlier, so I knew that file contained a copy of the driver:

    image

    A few operations later the System process loads Mrxcls.sys, activating the driver:

    image

    Next, Stuxnet prepares and loads its second driver, Mrxnet.sys. The trace shows Stuxnet writing the driver first to ~DFE.tmp, copying that file to the destination Mrxnet.sys file, and defining the Mrxnet.sys registry values:

    image

    A few operations later the System process loads the driver like it loaded Mrxcls.sys.

    The final modifications made by the virus include the creation of four additional files in the C:\Windows\Inf directory: Oem7a.pnf, Mdmeric3.pnf, Mdmcpq3.pnf and Oem6c.pnf.  The file creations are visible together after I set a filter that includes only CreateFile operations:

    image

    PNF files are precompiled INF files and INF files are device driver installation information files. The C:\Windows\Inf directory stores a cache of these files and usually has a PNF file for each INF file. Unlike the other PNF files in the directory, there are no matching INF files matching the names of Stuxnet’s PNF files, but their names make them blend in with the other files in that directory. Like for the operations writing the driver files, the stacks of these operations also have references to CopyFileEx, and disabling the write-exclusion filter shows that their source files are also the temporary files Stuxnet initially created. Here you can see Stuxnet copying ~Dfa.dmp to Oem7a.pnf:

    image

    All of the writes to these files are performed by the Lsass.exe process with the exception of a few writes to Mdmcpq3.pnf by the infected Services.exe process:

    image

    When done with the copies, Stuxnet takes additional steps to make the files blend in by setting their timestamp to match those of other PNF files in the directory, which on the sample system is November 4, 2009. The SetBasicInformationFile operation here sets the create time on Oem7a.pnf:

    image

    Once Stuxnet has set the timestamps, it cleans up after itself by marking the temporary files it created for deletion when it closes them (the operations deleting the other temporary files are in other parts of the trace):

    image

    It’s odd that Stuxnet writes temporary files and then makes copies of them, but it doesn’t appear to be a significant aspect of its execution since no Stuxnet research summary even mentions the temporary files.

    One operation in the trace that I can’t account for, and for which I’ve seen no explanation in any of the published Stuxnet analyses, is an attempt to delete a registry value named HKLM\System\CurrentControlSet\Services\Network\FailoverConfig:

    image

    That registry value and even the Network key referenced are not used by Windows or any component I could find. A search of the executables under the C:\Windows directory didn’t yield any hits. Perhaps Stuxnet creates the value under certain circumstances as a marker and this code automatically runs to delete it.

    Next Steps

    So far, our analysis of the Stuxnet infection with several Sysinternals tools has documented Stuxnet’s system impact at the time of infection, method of reactivation at subsequent boots, and provided a complete recipe for disabling and cleaning Stuxnet off a compromised system. In Part 3 I’ll wrap up my look at Stuxnet with the Sysinternals tools by examining how Stuxnet uses each of the four PNF files it created in order to gain some idea as to their purpose. I’ll also analyze a trace of a Windows 7 Stuxnet infection to show the method by which Stuxnet took advantage of a zero day vulnerability on Windows 7 (which has since been patched) to gain administrative rights when it was first activated with standard user rights. Continued with Part 3.

    Mark Russinovich is a Technical Fellow on the Windows Azure team at Microsoft and is author of Windows Internals, The Windows Sysinternals Administrator’s Reference, and the cyberthriller Zero Day: A Novel. You can contact him at markruss@microsoft.com.

  • Analyzing a Stuxnet Infection with the Sysinternals Tools, Part 1

    Though I didn’t realize what I was seeing, Stuxnet first came to my attention on July 5 last summer when I received an email from a programmer that included a driver file, Mrxnet.sys, that they had identified as a rootkit. A driver that implements rootkit functionality is nothing particularly noteworthy, but what made this one extraordinary is that its version information identified it as a Microsoft driver and it had a valid digital signature issued by Realtek Semiconductor Corporation, a legitimate PC component manufacturer (while I appreciate the programmer entrusting the rootkit driver to me, the official way to submit malware to Microsoft is via the Malware Protection Center portal).

    I forwarded the file to the Microsoft antimalware and security research teams and our internal review into what became the Stuxnet saga began to unfold, quickly making the driver I had received become one of the most infamous pieces of malware ever created. Over the course of the next several months, investigations revealed that Stuxnet made use of four “zero day” Windows vulnerabilities to spread and to gain administrator rights once on a computer (all of which were fixed shortly after they were revealed) and was signed with certificates stolen from Realtek and JMicron. Most interestingly, analysts discovered code that reprograms Siemens SCADA (Supervisory Control and Data Acquisition) systems used in some centrifuges, and many suspect Stuxnet was specifically designed to destroy the centrifuges used by Iran’s nuclear program to enrich Uranium, a goal the Iranian government reported the virus at least partially accomplished.

    As a result, Stuxnet has been universally acknowledged as the most sophisticated piece of malware created. Because of its apparent motives and clues found in the code, some researchers believe that it’s the first known example of malware used for state-sponsored cyber warfare. Ironically, I present several examples of malware targeting infrastructure systems in my recently-published cyber-thriller Zero Day, which when I wrote the book several years ago seemed a bit of a stretch. Stuxnet has proven the examples to be much more likely than I had thought (by the way, if you’ve read Zero Day, please leave a review on Amazon.com).

    Malware and the Sysinternals Tools

    My last several blog posts have documented cases of the Sysinternals tools being used to help clean malware infections, but malware researchers also commonly use the tools to analyze malware. Professional malware analysis is a rigorous and tedious process that requires disassembling malware to reverse engineer its operation, but systems monitoring tools like Sysinternals Process Monitor and Process Explorer can help analysts get an overall view of malware operation. They can also provide insight into malware’s purpose and help to identity points of execution and pieces of code that require deeper inspection. As the previous blog posts hint, those findings can also serve as a guide for creating malware cleaning recipes for inclusion in antimalware products. 

    I therefore thought it would be interesting to show the insights the Sysinternals tools give when applied to the initial infection steps of the Stuxnet virus (note that no centrifuges were harmed in the writing of this blog post). I’ll show a full infection of a Windows XP system and then uncover the way the virus uses one of the zero-day vulnerabilities to elevate itself to administrative rights when run from an unprivileged account on Windows 7. Keep in mind that Stuxnet is an incredibly complex piece of malware. It propagates and communicates using multiple methods and performs different operations depending on the version of operating system infected and the software installed on the infected system. This look at Stuxnet just scratches the surface and is intended to show how with no special reverse engineering expertise, Sysinternals tools can reveal the system impact of a malware infection. See Symantec’s W32.Stuxnet Dossier for a great in-depth analysis of Stuxnet’s operation.

    The Stuxnet Infection Vector

    Stuxnet spread last summer primarily via USB keys, so I’ll start the infection with the virus installed on a key. The virus consists of six files: four malicious shortcut files with names that are based off of “Copy of Shortcut to.lnk” and two files with names that make them look like common temporary files. I’ve used just one of the shortcut files for this analysis, since they all serve the same purpose:

    image

    In this infection vector, Stuxnet begins executing without user interaction by taking advantage of a zero-day vulnerability in the Windows Explorer Shell (Shell32.dll) shortcut parsing code. All the user has to do is open a directory containing the Stuxnet files in Explorer. To let the infection succeed, I first uninstalled the fix for the Shell flaw, KB2286198, that was pushed out by Windows Update in August 2010. When Explorer opens the shortcut file on an unpatched system to find the shortcut’s target file so that it can helpfully show the icon, Stuxnet infects the system and uses rootkit techniques to hide the files, causing them to disappear from view.

    Stuxnet on Windows XP

    Before triggering the infection, I started Process Monitor, Process Explorer and Autoruns. I configured Autoruns to perform a scan with the “Hide Microsoft and Windows Entries” and “Verify Code Signatures” options checked:

    image

    This removes any entries that have Microsoft or Windows digital signatures so that Autoruns shows only entries populated by third-party code, including code signed by other publishers. I saved the output of the scan so that I could have Autoruns compare against it later and highlight any entries added by Stuxnet. Similarly, I paused the Process Explorer display by pressing the space bar, which would enable me to refresh it after the infection and cause it to show any processes started by Stuxnet in the green background color Process Explorer uses for new processes. With Process Monitor capturing registry, file system, and DLL activity, I navigated to the USB key’s root directory, watched the temporary files vanish, waited a minute to give the virus time to complete its infection, stopped Process Monitor and refreshed both Autoruns and Process Explorer.

    After refreshing Autoruns, I used the Compare function in the File menu to compare the updated entries with the previously saved scan. Autoruns detected two new device driver registrations, Mrxnet.sys and Mrxcls.sys:

    image

    Mrxnet.sys is the driver that the programmer originally sent me and that implements the rootkit that hides files, and Mrxcls.sys is a second Stuxnet driver file that launches the malware when the system boots. Stuxnet’s authors could easily have extended Mrxnet’s cloak to hide these files from tools like Autoruns, but they apparently felt confident that the valid digital signatures from a well-known hardware company would cause anyone that noticed them to pass them over. It turns out that Autoruns has told us all we need to know to clean the infection, which is as easy as deleting or disabling the two driver entries.

    Turning my attention to Process Explorer, I also saw two green entries, both instances of the Local Security Authority Subsystem (Lsass.exe) process:

    image

    Note the instance of Lsass.exe immediately beneath them that’s highlighted in pink: a normal Windows XP installation has just one instance of Lsass.exe that the Winlogon process creates when the system boots (Wininit creates it on Windows Vista and higher). The process tree reveals that the two new Lsass.exe instances were both created by Services.exe (not visible in the screenshot), the Service Control Manager, which implies that Stuxnet somehow got its code into the Services.exe process.

    Process Explorer can also check the digital signatures on files, which you initiate by opening the process or DLL properties dialog and clicking on the Verify button, or by selecting the Verify Image Signatures option in the Options menu. Checking the rogue Lsass processes confirms that they are running the stock Lsass.exe image:

    image

    The two additional Lsass processes obviously have some mischievous purpose, but the main executable and command lines don’t reveal any clues. But besides running as children of Services.exe, another suspicious characteristic of the two superfluous processes is the fact that they have very few DLLs loaded, as shown by the Process Explorer DLL view:

    image

    The real Lsass has many more:

    image

    No non-Microsoft DLLs show up in the loaded-module lists for Services.exe, Lsass.exe or Explorer.exe, so they are probably hosting injected executable code. Studying the code would require advanced reverse engineering skills, but we might be able to determine where the code resides in those processes, and hence what someone with those skills would analyze, by using the Sysinternals VMMap utility. VMMap is a process memory analyzer that visually displays the address space usage of a process. To execute, code must be stored in memory regions that have Execute permission, and because injected code will likely be stored in memory that’s normally for data and therefore not usually executable, it might be possible to find the code just by looking for memory not backed by a DLL or executable that has Execute permission. If the region has Write permission, that makes it even more suspicious, because the injection would require Write permission and probably isn’t concerned with removing the permission once the code is in place. Sure enough, the legitimate Lsass has no executable data regions, but both new Lsass processes have regions with Execute and Write permissions in their address spaces at the same location and same size:

    image

    VMMap’s Strings dialog, which you open from the View menu, shows any printable strings in a selected region. The 488K region has the string “This program cannot be run in DOS mode" at its start, which is a standard message stored in the header of every Windows executable. That implies that the virus is not just injecting a code snippet, but an entire DLL:

    image

    The region is almost devoid of any other recognizable text, so it’s probably compressed, but the Windows API strings at the end of the region are from the DLL’s import table:

    image

    Explorer.exe, the initially infected process, and Services.exe, the process that launched the Lsass processes, also have no suspicious DLLs loaded, but also have unusual executable data regions:

    image

    The two Mrx drivers are also visible in the loaded driver list, which you can see in the DLL view of Process Explorer for the System process. The only reason they stand out at all is that their version information reports them to be from Microsoft, but their signatures are from Realtek (the certificates have been revoked, but since the test system is disconnected from the Internet, it is unable to query the Certificate Revocation List servers):

    image

    Looking Deeper

    At this point we’ve gotten about as far as we can with Autoruns and Process Explorer. What we know so far is that Stuxnet drops two driver files on the system, registers them to start when the system boots, and starts them. It also infects Services.exe and creates two Lsass.exe processes that run until system shutdown, the purpose of which can’t be determined by their command-lines or loaded DLLs. However, VMMap has given us pointers to injected code and Autoruns has given us an easy way to clean the infection. The Process Monitor trace from the infection has about 30,000 events, and from that we’ll be able to gain further insight into what happens at the time of the infection, where the injected code is stored on disk, and how Stuxnet activates the code at boot time. Read more in Part 2.

    Mark Russinovich is a Technical Fellow on the Windows Azure team at Microsoft and is author of Windows Internals, The Windows Sysinternals Administrator’s Reference, and the cyberthriller Zero Day: A Novel. You can contact him at markruss@microsoft.com.

  • Zero Day is Here!

    imageI’m excited to announce that my first novel, a cyber thriller entitled Zero Day, is now available at all major book retailers!

    Zero Day is a book in the style of Crichton and Clancy, weaving technical fact into the story. If you like the Sysinternals tools, the articles I post on this blog, are interested in computer security, or just enjoy a heart-stopping thriller, you’ll like Zero Day.  You can read a synopsis and a sample chapter, as well as find pointers to on-line book sellers, at the Zero Day web site.

    I’m really pleased by the initial reviews, which have been very positive. Here is just a sampling:

    "Zero Day is an addictive read that will stay with you for a long time to come. It is a MUST READ!"
    http://yougottareadreviews.blogspot.com/2011/01/review-zero-day-by-mark-russinovich.html

    "If you aren't a computer geek, some of the lingo and explanations are going to pass right by you; but there's enough information and ever-developing, terrifying plot developments to keep you riveted to every page."
    http://crystalbookreviews.blogspot.com/2011/01/zero-day-novel-by-mark-russinovich.html

    "The entertaining story line is linear yet exhilarating and frightening especially since author Mark Russinovich is an expert on the topic as his résumé brings a scary possibility to the cyber attack that the thriller focuses on."
    http://genregoroundreviews.blogspot.com/2011/01/zero-day-mark-russinovich.html

    "The novel is more plot than characters, but it is a very frightening, fast moving narrative that reveals how interconnected we all are through the internet."
    http://bookgarden.blogspot.com/2011/01/zero-day-by-mark-russinovich.html

    You can read all the reviews I’ve collected so far on the Praise for Zero Day page at the Zero Day Web site.

    If you’re curious about the novel publishing process, you can read my three-part blog post describing my experience with Zero Day, from the initial idea, to finding an agent, signing a publisher, and final publication: The Road to Zero Day

    Buy the book, leave a review on Amazon.com, follow me on Twitter, meet me at a book signing, and send a note sharing your thoughts to markrussinovich@hotmail.com.

    I hope you enjoy the book look forward to hearing from you!

  • The Case of the Unusable System

    This post continues in the malware hunting theme of the last couple of posts as Zero Day availability draws near (it’s available tomorrow!). It began when a friend of mine at Microsoft told me that a neighbor of hers had a laptop that malware had rendered unusable and asked if as a favor I’d be willing to take a look. Her friend was desperate because she had important files, including documents and pictures, on the laptop and had no backup.

    Unlike most people in the computer industry that view the requests of friends and family for troubleshooting help as a burden to be avoided, I embrace the challenge. When fixing a system or application problem, it’s me against the computer and success is satisfying and always a learning experience. But that success also has an academic feel. With malware, it becomes personal, pitting me against the minds of criminal hackers. Defeating malware is a victory of good over evil. I should print a t-shirt that says “Yes, I will fix your computer!”. I immediately agreed and we made arrangements to get the laptop dropped off at my office.

    When I had a few free minutes the next day I powered on the laptop, logged in, and within seconds was greeted with a torrent of warning dialogs announcing that the computer was infested with malware and that it was under attack from the Internet:

    image

    I also saw a barrage of warnings that various applications had been stopped from launching because they were infected:

    image

    I hadn’t seen scareware this aggressive. After a minute the appearance of new warnings ceased and I began my investigation. Starting with the insertion of a USB key containing the Sysinternals tools, I tried launching Process Explorer. However, I found that trying to run anything - whether part of Windows or third-party - resulted not in the execution of the application, but in the display of the same “Security Warning” dialog reporting that the application was infected. This system was truly unusable.

    The infected account was the only one configured, so that ruled out trying to clean from a different account in the hope that it might not be infected. I was afraid that cleaning the malware might require off-line access to the system via a boot CD installed with the Microsoft Diagnostic and Repair Toolset (the Microsoft product that’s the descendent of ERD Commander, the product I created at Winternals Software). My MSDaRT CD was at home and I’d have to burn a new one. But I had noticed when I logged on that it was 5-10 seconds before the first popups started appearing. If the malware didn’t block running applications during that time window, either because it was initializing or just letting the first few logon applications run so that the Explorer could fully start, I might be able to sneak Process Explorer and Autoruns in before the lock down. That would save me the time and trouble of burning a CD. It was worth a try.

    Before logging off, I copied Process Explorer and Autoruns to the desktop for easy access. I logged on and double-clicked the icons in quick succession. There was a short pause and then both applications appeared. It had worked! I had to wait for the avalanche of warning dialogs to stop and then turned my attention to Process Explorer. Sure enough, one process stood out, hgobsysguard.exe:

    image

    I explain the common characteristics of malware in my Advanced Malware Cleaning presentation and this sample had all the telltale signs:

    • Random or unusual name: hgobsysguard.exe seems like it might be legitimate, but I had never seen or heard of it and the name revealed nothing of its purpose or origin
    • No company name or description: legitimate software almost always includes a company name and description in the version resource of their executables. Malware often omits this since most users never run tools that show this information.
    • Installed somewhere other than the \Program Files directory: you should add software not installed in the \Program Files directory to the list of suspects for closer inspection. In this case the executable was installed in the user’s profile, another sign of malware.
    • Encrypted or compressed: In order to avoid detection by antivirus and make analysis more difficult, malware authors often encrypt their executables. Process Explorer uses heuristics to try to identity encrypted executables, which it refers to as “packed”, and it highlights them in purple like it did for this one.

    I carefully studied the other running executables, including the services running within the various Svchost.exe hosting processes, but I didn’t see anything else that looked suspicious. Sometimes malware employs the “buddy system”, where it uses two processes, each watching the other so that if either terminates, the other restarts it, making it virtually impossible to terminate them. When I see that I use Process Explorer’s suspend feature to put both to sleep and then kill them (which is also arguably more humane). Here all I had was one malicious process, so I just terminated it. It didn’t reappear, which was a good sign that there wasn’t a buddy lurking within another process as a DLL. I then navigated to the malware’s install directory and deleted its files.

    With the process and executables out of the way, the next step was to determine how the malware activated and delete its autostart entries. I switched to Autoruns, which had finished its scan in the meantime, and spotted two entries pointing at the malware’s executable. Both entries had names that appeared to have been randomly generated, consistent with typical malware:

    image

    I deleted the entries, studied the rest in case there was some other component that wasn’t so obvious, and did some standard crapware cleanup while I was there. I rebooted the system and logged back on to confirm it was clean. This time there were no popups, I was able to run software as normal, and neither Process Explorer nor Autoruns showed any sign of more infection. I had spent a total of five minutes and had some fun outwitting the malware to avoid offline cleaning. Case closed.

  • The Case of the Sysinternals-Blocking Malware

    Continuing the theme of focusing on malware-related cases (last week I posted The Case of the Malicious Autostart) as a lead up to the publication on March 15 of my novel Zero Day, this post describes one submitted to me by a user that took a unique approach to cleaning an infection when faced with the apparent inability to run Sysinternals utilities.

    More and more often, malware authors target antivirus products and Sysinternals utilities in an effort to maintain their grip on a conquered system. This case began when the user’s friend asked if he’d take a look at his computer, which had begun taking an unusually long times to boot and logon. The friend, already suspecting that malware might be the cause, had tried to run a Microsoft Security Essentials (MSE) scan, but the scan would never complete. They also hadn’t spotted anything in Task Manager.

    The user, familiar with Sysinternals, tried following the malware cleaning recipe I presented in my Advanced Malware Cleaning presentation. Double-clicking on Process Explorer resulted in a brief flash of the Process Explorer UI followed by the termination of the Process Explorer process, however. He turned to Autoruns next, but the result was the same. Process Monitor had the same behavior and at this point he became convinced the malware was responsible.

    Malware can use numerous techniques to identify software that it wants to disable. For example, it can use the hash of the software’s executables, look for specific text in the executable images, or scan process memory for keywords. The fact that any small unique attribute is all that’s needed is the reason I haven’t bothered implementing mechanisms aimed at preventing identification. It’s a game I can’t win so I leave it to the ingenuity of the user to figure out a workaround. If the malware is simply keying off the names of executables, for instance, the user could simply rename the tools.

    What makes this case somewhat ironic is that malware authors have long used various Sysinternals tools themselves. For example, the Clampi trojan, which spread in early 2009, used the Sysinternals PsExec utility to automatically spread. Coreflood, a virus that stole passwords in mid-2008, also used PsExec. More recently, Chinese hackers used Sysinternals tools to attack oil refineries. Malware authors even hijacked the Sysinternals brand by releasing a “scareware” product – malware that presents fake security dialogs to lure you into buying fake antimalware – named Sysinternals Antivirus:

    image

    Back to the case, the user, wondering if the malware was looking for particular processes or simply scanning for windows with certain keywords in their title bars, opened notepad, typed some text, and saved it to a file named “process explorer.txt”. Sure enough, when he double-clicked on the new text file, Notepad made a brief appearance before exiting.

    Locked out of his usual troubleshooting tools, he wondered if there might be some other Sysinternals utility that he could leverage, browsed to the Sysinternals utilities index and scanned the list. Just a few tools down, the Desktops utility caught his attention. Desktops lets you create up to three additional virtual desktops for running your applications and use hotkeys or the Desktops taskbar dialog to quickly switch between them. Maybe the malware would ignore windows on alternate desktops? He launched Desktops using its Sysinternals Live link (which lets you execute the utilities off the Web without even having to download them) and created a second desktop. Holding his breath, he double-clicked on the Process Explorer icon – and it launched!

    image

    This particular malware presumably has a timer-based routine that queries window title text and terminates processes that have titles with blocked keywords like “process explorer”, “autoruns”, “process monitor” and likely the names of other advanced malware-hunting tools and common antivirus products. Because a window enumeration only returns the windows on a process’s current desktop, the malware was not able to see the Sysinternals tools running on the second desktop.

    He didn’t spot anything unusual in the Process Explorer process list, so he launched Process Monitor (I would have tried Autoruns next). He let Process Monitor capture a couple of minutes of activity and then began examining the trace. His eye was immediately drawn to thousands of Winlogon registry operations, something he normally didn’t observe when he ran Process Monitor. Guessing that it was related to the malware, he set a filter to just include Winlogon and took a closer look:

    image

    Most of the operations were registry queries of values under a key with a bizarre name, HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon Notify\acdcacaeaacbafbeaa. In order to start every time Windows boots, the malware appeared to have registered itself as a Winlogon notification DLL. Winlogon notification DLLs are commonly used by software that monitors logon, logoff and password change events, but are also often hijacked by malware. To confirm his suspicion and find the name of the DLL, he right-clicked on one of the entries and selected “Jump To” from the Process Monitor context menu. In response, Process Monitor executed Regedit and navigated to the referenced key:

    image

    The DLLName value pointed at the malicious DLL, which had the same name - probably randomly generated – as the registry key. He knew at this point that the malware was probably interfering with the MSE scan, but armed with the name of the DLL, he wondered if MSE might be able to clean that specific file. Before he tried that he tried a full scan, weakly hoping that the malware wouldn’t detect the execution on the second desktop, but was unsuccessful. He launched MSE again and navigated the file-scan dialog to the DLL. A couple of seconds later MSE completed the analysis and reported that it both knew the malware and was able to automatically clean it:

    image

    He pressed the recommended action button and MSE quickly dispatched the malware. As a final check, he rebooted the system. Sure enough, the system booted quickly and the logon was fast. He was able to run the Sysinternals tools on the main desktop and Process Monitor’s trace was devoid of the malicious activity. With the help of a Sysinternals tools, he had vanquished the Sysinternals-blocking malware and successfully closed the case.

  • The Case of the Malicious Autostart

    Given that my novel, Zero Day, will be published in a few weeks and is based on malware’s use as a weapon by terrorists, I thought it appropriate to post a case that deals with malware cleanup with the Sysinternals tools. This one starts when Microsoft support got a call from a customer representing a large US hospital network reporting that they had been hit with an infestation of the Marioforever virus. They discovered the virus when their printers started getting barraged with giant print jobs of garbage text, causing their network to slow and the printers to run out of paper. Their antivirus software identified a file named Marioforever.exe in the %SystemRoot% folder of one of the machines spewing files to the printers as suspicious, but deleting the file just resulted in it reappearing at the subsequent reboot. Other antivirus programs failed to flag the file at all.

    The Microsoft support engineer assigned the case, started looking for clues by seeing if there were additional suspicious files in the %SystemRoot% directory of one of the infected systems. One file, a DLL named Nvrsma.dll, had a recent timestamp and although it was named similarly to Nvidia display driver components, the computer in question didn’t have an Nvidia display adapter. When he tried to delete or rename the file, he got a sharing violation error, which meant that some process had the file open and was preventing others from opening it. There are several Sysinternals tools that will list the processes that have a file open or a DLL loaded, including Process Explorer and Handle. Because the file was a DLL, though, the engineer decided on the Sysinternals Listdlls utility, which showed that the DLL was loaded by one process, Winlogon:

    image

    Winlogon is the core system process responsible for managing interactive logon sessions, and in this case was also the host for a malicious DLL. The next step was to determine how the DLL was configured to load into Winlogon. That had to be via an autostart location, so he ran the Autoruns utility, but there was no sign of Nvrsma.dll and all the autostart entries were either Windows components or legitimate third-party components. That appeared to be a dead end.

    If he could watch Winlogon’s startup with a file system and registry monitoring utility like Process Monitor, he might be able to determine the magic that got Winlogon to load Nvrsma.dll. Winlogon starts during the boot process, however, so he had to use Process Monitor’s boot logging feature. When you configure Process Monitor to log boot activity, it installs its driver so that the driver loads early in the boot process and begins monitoring, recording activity to a file named %SystemRoot%\Procmon.pmb. The driver stops logging data to the file either when someone launches the Process Monitor executable or until the system shuts down.

    After configuring Process Monitor to capture boot activity and rebooting the system, the engineer ran Process Monitor and loaded the boot log. He searched for “nvrsma” and found this query by Winlogon of the registry value HKLM\Software\Microsoft\Windows NT\CurrentVersion\bwpInit_DLLs that returned the string “nvrsma”:

    image

    The engineer had never seen a value named bwpInit_DLLs, but the name was strikingly similar to an autostart entry point he did know of, AppInit_DLLs. The AppInit_DLLs value is one that User32.dll, the main window manager DLL, reads when it loads into a process. User32.dll loads any DLLs referenced in the value, so any Windows application that has a user-interface (as opposed to being command-line oriented) loads the DLLs listed in the value. Sure enough, a few operations later in the trace he saw Winlogon load Nvrsma.dll:

    image

    Its power to cause a DLL to get loaded into virtually every process has made AppInit_DLLs a favorite of malware authors. In fact, it’s become such a nuisance that in Windows 7 the default policy requires that DLLs listed in the value be code-signed to be loaded.

    The boot trace had no reference to AppInit_DLLs, making it obvious that the malware had somehow coerced User32.dll into querying the alternate location. It also explained why the entry hadn’t shown up in the Autoruns scan. One question he had was why no other process had Nvrsma.dll loaded into it, but further into the trace he saw that an attempt to load the DLL by another process resulted in the same sharing violation error he’d encountered:

    image

    Simply loading a DLL won’t cause a handle to remain open and cause this kind of error, so he searched backward, looking for other CreateFile operations on the DLL that had no corresponding CloseFile operation. The last such operation before the sharing violation was performed by Winlogon:

    image

    The stack of the operation, which he viewed by double-clicking on the operation to open the properties dialog and then clicking on the Stack tab, showed that it was Nvrsma.dll itself that opened the file, presumably to protect itself from being deleted and to prevent itself from loading into other processes:

    image

    Now he had to determine how User32.dll was compromised. User32.dll is one of the system “Known DLLs”, which means that as a performance optimization Windows creates a file mapping at boot time that can then be used by any process that loads the DLL. These known DLLs are listed in a registry key that Autoruns lists in the KnownDLLs tab, so the engineer went back to the Autoruns scan to take a closer look. The most effective way to spot potential malware when using Autoruns is to run it with the Verify Code Signatures option set, which has Autoruns check the digital signature of the images it finds. Upon closer inspection, the engineer noticed that User32.dll, unlike the rest of the Known DLLs, did not have a valid digital signature:

    image

    The compromised User32.dll behaved almost identically to the actual User32.dll, otherwise applications with user-interfaces would fail, but it seemed to be different enough to cause it to query the alternate registry location. To verify this, he ran the Sysinternals Sigcheck utility on the tweaked copy and on one from a different, uninfected, system that was running the same release of Windows. A side-by-side comparison of the output, which includes MD5, SHA-1 and SHA-256 cryptographic hashes of the file, confirmed they were different:

    image

    As a final check to make sure that the difference was indeed responsible for the different behavior, the engineer decided to scan the strings in the DLL. Any registry keys and values, as well as file names, used by an executable will be stored in the executable’s image file and be visible to a string-scanning tool. He tried using the Sysinsternals Strings utility, but the sharing violation error prevented Strings from opening the compromised User32.dll, so he turned to Process Explorer. When you open the DLL view for a process and open the properties of a DLL, Process Explorer shows the printable strings on the Strings tab. The results, which revealed the modified APPInit_DLLs string, validated his theory:

    clip_image002[4]    image

    With the knowledge of exactly how the malware’s primary DLL activated, the engineer set out to clean the malware off the system. Because User32.dll would be locked by the malware whenever Windows was online (otherwise you can rename the file and replace it, which is what the malware did), he booted the Windows Preinstallation Environment (WinPE) off a CD-ROM and from there copied a clean User32.dll over the malicious version. Then he deleted the associated malware files he’d discovered in his investigation. When through, he rebooted the system and verified that the system was clean. He closed the case by giving the hospital network administrators the cleaning steps he’d followed and submitted the malware to the Microsoft antimalware team so that they could incorporate automated cleaning into Forefront and the Malicious Software Removal Toolkit. He’d solved a seemingly impossible case by applying several Sysinternals utilities and helped the hospital get back to normal operation.

  • The Cases of the Blue Screens: Finding Clues in a Crash Dump and on the Web

    imageMy last couple of posts have looked at the lighter side of blue screens by showing you how to customize their colors. Windows kernel mode code reliability has gotten better and better every release such that many never experience the infamous BSOD. But if you have had one (one that you didn’t purposefully trigger with Notmyfault, that is), as I explain in my Case of the Unexplained presentations, spending a few minutes to investigate might save you the inconvenience and possible data loss caused by future occurrences of the same crash. In this post I first review the basics of crash dump analysis. In many cases, this simple analysis leads to a buggy driver for which there’s a newer version available on the web, but sometimes the analysis is ambiguous. I’ll share two examples administrators sent me where a Web search with the right key words lead them to a solution. 

    Debugging a crash starts with downloading the Debugging Tools for Windows package (part of the Windows SDK – note that you can do a web install of just the Debugging Tools instead of downloading and installing the entire SDK), installing it, and configuring it to point at the Microsoft symbol server so that the debugger can download the symbols for the kernel, which are required for it to be able to interpret the dump information. You do that by opening the symbol configuration dialog under the File menu and entering the symbol server URL along with the name of a directory on your system where you’d like the debugger to cache symbol files it downloads:

    image

    The next step is loading the crash dump into the debugger Open Crash Dump entry in the File menu. Where Windows saves dump files depends on what version of Windows you’re running and whether it’s a client or server edition. There’s a simple rule of thumb you can follow that will lead you to the dump file regardless, though, and that’s to first check for a file named Memory.dmp in the %SystemRoot% directory (typically C:\Windows); if you don’t find it, look in the %SystemRoot%\Minidumps directory and load the newest minidump file (assuming you want to debug the latest crash).

    When you load a dump file into the debugger, the debugger uses heuristics to try and determine the cause of the crash. It points you at the suspect by printing a line that says “Probably caused by:" with the name of the driver, Windows component, or type of hardware issue. Here’s an example that correctly identifies the problematic driver responsible for the crash, myfault.sys:

    image

    In my talks, I also show you that clicking on the !analyze -v hyperlink will dump more information, including the kernel stack of the thread that was executing when the crash occurred. That’s often useful when the heuristics fail to pinpoint a cause, because you might see a reference to a third-party driver that, by being active around the site of the crash, might be the guilty party. Checking for a newer version of any third-party drivers displayed in this basic analysis often leads to a fix. I documented a troubleshooting case that followed this pattern in a previous blog post, The Case of the Crashed Phone Call.

    When you don’t find any clues, perform a Web search with the textual description of the crash code (reported by the !analyze -v command) and any key words that describe the machine or software you think might be involved. For example, one administrator was experiencing intermittent crashes across a Citrix server farm. He didn’t realize he could even look at a crash dump file until he saw a Case of the Unexplained presentation. After returning to his office from the conference, he opened dumps from several of the affected systems.  Analysis of the dumps yielded the same generic conclusion in every case, that a driver had not released kernel memory related to remote user logons (sessions) when it was supposed to:

    image

    Hoping that a Web search might offer a hint and not having anything to lose, he entered “session_has_valid_pool_on_exit and citrix” in the browser search box. To his amazement, the very first result was a Citrix Knowledge Base fix for the exact problem he was seeing, and the article even displayed the same debugger output he was seeing:

    image

    After downloading and installing the fix, the server farm was crash-free.

    In another example, an administrator saw a server crash three times within several days. Unfortunately, the analysis didn’t point at a solution, it just seemed to say that the crash occurred because some internal watchdog timer hadn’t fired within some time limit:

    image

    Like the previous case, the administrator entered the crash text into the search engine and to his relief, the very first hit announced a fix for the problem:

    image

    The server didn’t experience any more crashes subsequent to the application of the listed hotfix.

    These cases show that troubleshooting is really about finding clues that lead you to a solution or a workaround, and those clues might be obvious, require a little digging, or some creativity. In the end it doesn’t matter how or where you find the clues, so long as you find a solution to your problem.

  • Announcing Zero Day, the Novel!

    You’ve seen the news if you’re my friend on Facebook, follow me on Twitter, or subscribe to the Sysinternals blog: I’m proud to announce that my first novel, a cyberthriller entitled Zero Day, is due to be published by St. Martin’s Press in mid-March. If you like the Sysinternals tools, the articles I post on this blog, are interested in computer security, or just enjoy a heart-stopping thriller, I think you’ll like Zero Day. You can find out more and pre-order on the Zero Day web site and I've started a Zero Day blog there that will focus exclusively on book and cybersecurity news and tips. Pre-order now to guarantee a copy on release day and pass the word to your friends!

  • “Blue Screens” in Designer Colors with One Click

    My last blog post described how to use local kernel debugging to change the colors of the Windows crash screen, also known as the “blue screen of death”. No doubt many of you thought that showing off a green screen of death or red screen of death to your friends and family would be fun, but the steps involved too complicated.

    Alex Ionescu, one of my coauthors on Windows Internals, 5th Edition (he’s also coauthoring the 6th edition with me and Dave Solomon, which covers Windows 7 and Windows Server 2008 R2 – scheduled for release this summer), suggested that we make it easy for people to enjoy blue screens of any color. We did so by modifying Notmyfault, a buggy driver demonstration tool that I wrote for the book and my crash dump analysis presentations. Simply make your color section in the new BSOD color picker dialog, press the “Do Bug” button, and enjoy your creation:

    image

    Here’s the “blue screen” that results from the above color choice:

    image

    It’s as easy as that - there’s no need to tweak large-page settings or perform any other system configuration changes like those described in my last blog post.

    How does it work? We extended Notmyfault’s kernel-mode driver (named Myfault.sys, as seen on the crash screen, to highlight the fact that user-mode code cannot directly cause a system crash) to register a “bugcheck callback”. When the system crashes it invokes driver-registered callbacks so that they can add data to the crash dump that can help troubleshooters get information about device or driver state at the time of a crash. The Myfault.sys callback executes just after the blue screen paints and changes the colors to the ones passed to it by Notmyfault by changing the default VGA palette entries used by the Boot Video driver.

    Now with no awkward and error-prone fiddling in a kernel debugger, you can impress your friends and family with a blue screen painted in your favorite colors (though they might be even more impressed if you change the colors by fiddling in the kernel debugger)!

    To download the latest copy of Notmyfault (both 32-bit and 64-bit versions) click here.

  • A Bluescreen By Any Other Color

    Note: for an easier way to customize the blue screen’s colors, see my next blog post, “Blue Screens in Designer Colors with One Click”.

    Seeing a bluescreen that’s not blue is disconcerting, even for me, and based on the reaction of the TechEd audiences, I bet you’ll have fun generating ones of a color you pick and showing them off to your techy friends. I first saw Dan Pearson do this in a crash dump troubleshooting talk he delivered with Dave Solomon a couple of years ago and now close my Case of the Unexplained presentations with a bluescreen of the color the audience choses (you can hear the audience’s response at the end of this recording, for example). Note that the steps I’m gong to share for changing the color of the bluescreen are manual and only survive a boot session, so are suitable for demonstrations, not for general bluescreen customization. Be sure to check out the special holiday bluescreen I’ve prepared for you at the end of the post.

    Preparing the System

    Because you’re going to modify kernel code, the first step is to enable the ability to edit kernel code in memory if it’s not already enabled. Windows systems with less than 2 GB of RAM uses 4KB pages to store kernel code, so can protect pages with the protection most suitable for the contents they contain. For instance, kernel data pages should allow both read and write access while kernel code should only allow read and execute access. As an optimization that helps improve the speed of virtual address translations, Windows uses large pages (4 MB on x86 and x64) on larger systems. That means that if there’s both code and data stored in a page, the page must allow read, write and execute accesses, so to ensure that you can edit a page, you have to encourage Windows to use large pages. If your system is Windows XP or Server 2003 and has less than 256 MB, or is Windows Vista or higher and has 2 GB or less of RAM, create a REG_DWORD value called LargePageMinimum that’s set to 1 under HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management:

    image

    So that you don’t have to rush to show off your handiwork before Windows automatically reboots after the crash, change the auto-reboot setting. On Windows XP and Server 2003, right-click on My Computer, select the Advanced Tab, and press the Settings button in the “Startup and Recovery” section. On Windows Vista and higher, right-click on Computer in the Start Menu, select properties to open the Properties dialog, click Advanced System Settings, select the Advanced tab and press the Settings button in the “Startup and Recovery” section. Finally, uncheck the “Automatically restart” checkbox:

    SNAGHTML5fc0cb41_thumb2

    If you’re running 64-bit Windows Vista or higher, you need to boot the system in Debug mode so that you can run the kernel debugger in “local” mode. You can do that either by selecting F8 during the system boot and choosing the Debug boot or by checking the Debug checkbox in the System Configuration (Msconfig) utility:

    image_thumb31

    Next, reboot the system and start the debugger with administrator rights (if UAC is on, run it as administrator). Point the debugger at the Microsoft symbol server by opening the Symbol Search Path dialog under the File menu and enter this string: srv*c:\symbols*http://msdl.microsoft.com/download/symbols (replace c:\symbols with whatever local directory in which you want the debugger to store cached symbols). Next, open the Kernel Debugging dialog from the File menu, click the Local page, and press OK:

    image_thumb33

    The subsequent steps vary depending on whether you’re running 32-bit or 64-bit Windows and whether it’s Windows Vista or newer.

    32-bit Windows XP and Windows Server 2003

    The function that displays the bluescreen on these operating systems is KeBugCheck2. You’re looking for the place where the function passes the color value to the function that fills the screen background, InbvSolidColorFill. Enter the command “u kebugcheck2” to list the start of the function, then enter the “u” command to dump additional pages of the function’s code until you see the reference to InbvSolidColorFill (after entering “u” once, you can just press enter to repeat the command). You’ll need to dump 30-40 pages before you come across the one with the call:

    image

    Preceding the call, you’ll see an instruction that has the number 4 as its argument (“push 4”), as you can see above. Copy the code address of that instruction by selecting it from the address column on the left and typing Ctrl+C. Then in the debugger command window, type “eb “, then Ctrl+V to paste the address, then “+1”, then enter. The debugger will go into memory editing mode, starting with the address of the color value. Now you can choose the color you want. 1 is red, 2 is green, and you can experiment if you want a different color. Simply enter the number and press enter twice to commit it and exit editing mode. Here’s what the screen should look like after you’re done:

    image_thumb38

    64-bit Versions of Windows and 32-bit Windows Vista and Higher

    On these versions of Windows, the core bluescreen drawing function is KiDisplayBlueScreen. Type “u kidisplaybluescreen” and then continue entering “u” commands to dump pages of the function until you see the call to InbvSolidColorFill. On 32-bit versions of Windows, continue by following the instructions given in the Windows XP/Server 2003 section to find and edit the color value. On 64-bit versions of these operating systems, the instruction preceding the call to InvbSolidColorFill is the one that passes the color, so copy its address (the number in the left column) and enter this command to edit it: “eb <address>+4”. The debugger will go into memory editing mode and you can change the value (e.g. 1 for red, 2 for green):

    image_thumb42

    Viewing the Result

    You’re now ready to crash the system. If you’re running 64-bit Windows, you might get a crash without doing anything additionally. That’s because Kernel Patch Protection will notice the modification and crash the system as a deterrent to ISVs that might consider modifying the kernel’s code to change its behavior. There might be a delay of up to several minutes before that happens, though. To generate a crash on demand, run the Notmyfault tool (you can download it from the Windows Internals book page) and press the “Do Bug” button (to avoid data loss, make sure you’ve saved any work and closed all other applications):

    image_thumb45

    You’ll now get a bluescreen in the color you picked, in this case the red screen of death:

    image_thumb47

    The Holiday Bluescreen

    In the spirit of the holiday season, I took things one step further to generate a holiday-themed bluescreen: not only did I modify the background color, but the text color as well. To do this on 64-bit versions of Windows Vista or higher, note the call to InvbSetTextColor immediately following the one to InvbSolidColorFill and the address of the instruction that passes the text color to the function, “move ecx, 0Fh”:

    image 

    The 0Fh parameter represents white, but you can change it using the same editing technique. Use the “eb” command, passing the address of the instruction plus 1. Here I set the color to red (which is a value of 1):

    image

    And here’s the festive bluescreen I produced:

    image

    Happy holidays! And remember, if you have any troubleshooting cases you want to share, please send me screenshots (.PNG preferred) and log files.

  • The Case of the Slow Project File Opens

    If you’ve seen one of my Case of the Unexplained presentations (like the one I delivered at TechEd Europe last month that’s posted for on-demand viewing), you know that I emphasize how thread stacks are a powerful troubleshooting tool for diagnosing the root cause of performance problems, buggy behavior, crashes and hangs (I provide a brief explanation of what a stack is in the TechEd presentation). That’s because often times the explanation for a process’s behavior lies in the code it loads, either explicitly like in the case of DLLs it depends on, or implicitly like for processes that host extensions. This case is another demonstration of successful stack troubleshooting. It also shows how a little time troubleshooting to get a couple of clues can quickly lead to a solution.

    The case opened when the customer, a network administrator, contacted Microsoft support because a user reported that Microsoft Project files located on a network share were taking up to a minute to open and about 1 in 10 times the open resulted in an error:

    image

    The administrator verified the issue, checked networking settings and latency to the file server, but could not find anything that would explain the problem. The Microsoft support engineer assigned to the case asked the administrator to capture a Process Monitor and Network Monitor traces of a slow file open. After receiving the log a short time later, he opened the log and set a filter to include only operations issued by the Project process and then another filter to include paths that referenced the target file share, \\DBG.ADS.COM\LON-USERS-U. The File Summary dialog, which he opened from Process Monitor’s Tools menu, showed significant time spent in file operations accessing files on the share, shown in the File Time column:

    image

    The paths in the trace revealed that the user profiles were stored on the file server and that the launch of Project caused heavy access of the profile’s AppData subdirectory. If many users had their profiles stored on the same server via folder redirection and were running similar applications that used stored data in AppData, that would surely account for at least some of the delays the user was experiencing. It’s well known that redirecting the AppData directory can result in performance problems, so based on this, the support engineer arrived at his first recommendation: for the company to configure their roaming user profiles to not redirect AppData and to sync the AppData directory only at logon and logoff as per the guidance found in this Microsoft blog post:

    Special considerations for AppData\Roaming folder:
    If the AppData folder is redirected, some applications may experience performance issues because they will be accessing this folder over the network. If that is the case, it is recommended that you configure the following Group Policy setting to sync the AppData\Roaming folder only at logon and logoff and use the local cache while the user is logged on. While this may have an impact on logon/logoff speeds, the user experience may be better since applications will not freeze due to network latency.

    User configuration>Administrative Templates>System>User Profiles>Network directories to sync at Logon/Logoff.

    If applications continue to experience issues, you should consider excluding AppData from Folder Redirection – the downside of doing so is that it may increase your logon/logoff time.

    Next, the engineer examined the trace to see if Project was responsible for all the traffic to files like Global.MPT, or if an add-in was responsible. This is where the stack trace was indispensible. After setting a filter to show just accesses to Global.MPT, the file that accounted for most of the I/O time as shown by the summary dialog, he noticed that it was opened and read multiple times. First, he saw 5 or 6 long runs of small random reads:

    image

    The stacks for these operations showed that Project itself was responsible, however:

    image

    He also saw sequences of large, non-cached reads. The small reads he looked at first were cached, so there would be no network access after the first read caused the data to cache locally, but non-cached reads would go to the server every time, making them much more likely to impact performance:

    image

    To make matters worse, he saw this sequence six times in the trace, which you can see with a filter set to just show the initial read of each sequence:

    image

    The stacks for these reads revealed them to be the result of a third-party driver, which was visible by the fact that the stack trace dialog, which he’d configured to obtain symbols from Microsoft’s public symbol servers, showed no symbol information:

    image

    Further, the stack frames higher up the same stack showed that the sequence of reads were being performed within the context of Project opening the file, which is a behavior common to on-access virus scanners:

    image

    Sure enough, double-clicking on one of the SRTSP64.SYS lines in the stack dialog confirmed that it was Symantec AutoProtect that was repeatedly performing on-access virus detection each time Project opened the file with certain parameters:

    image

    Typically, administrators configure antivirus on file servers, so there’s no need for clients to scan files they reference on servers since client-side scanning simply results in duplicative scans. This lead to the support engineer’s second recommendation, which was for the administrator to set an exclusion filter on their client antivirus deployment for the file share hosting user profiles.

    In less than fifteen minutes the engineer had written up his analysis and recommendations and sent them back to the customer. The network monitor trace merely served as confirmation of what he observed in the Process Monitor trace. The administrator proceeded to implement the suggestions and a few days later confirmed that the user was no longer experiencing long file loads or the errors they had reported. Another case closed with Process Monitor and thread stacks.

  • LiveKd for Virtual Machine Debugging

    When Dave Solomon and I were writing the 3rd edition of the Windows Internals book series Inside Windows 2000 back in 1999, we pondered if there was a way to enable kernel debuggers like Windbg and Kd (part of the free Debugging Tools for Windows package that’s available in the Windows Platform SDK) to provide a local interactive view of a running system. Dave had introduced kernel debugger experiments in the 2nd edition, Inside Windows NT, that solidified the concepts presented by the book. For example, the chapter on memory management describes the page frame database, the data structure the system uses to keep track of the state of every page of physical memory, and an accompanying experiment shows how to view the actual data structure definition and contents of PFN entries on a running system using the kernel debugger. At the time, however, the only way to use Windbg and Kd to view kernel information was to attach a second computer with a serial “null modem” cable to the target system booted in debugging mode. The inconvenience of having to purchase an appropriate serial cable and configure two systems for kernel debugging meant that many readers skipped the experiments, but otherwise might have followed along and deepened their understanding if it was easier.

    After giving it some thought, I realized that I could fool the debuggers into thinking that they were looking at a crash dump file by implementing a file system filter driver that presented a “virtual” crash dump file debuggers could open. Since a crash dump file is simply a file header followed by the contents of physical memory, the driver could satisfy reads of the virtual dump file with the contents of physical memory, which the driver could easily read from the \Device\Physical Memory section object the memory manager creates. A couple of weeks later, LiveKd was born. We expanded the number of kernel debugger experiments in the book and began using LiveKd in our live Windows Internals seminars and classes as well.  LiveKd’s usage went beyond merely being an educational tool and over time became an integral part of IT pros and Microsoft support engineers troubleshooting toolkit. Microsoft even added local kernel debugging capability to Windows XP, but LiveKd can still do a few things that the native support can’t, like saving a copy of the system’s state to a dump file that can be examined on a different system and it works on Windows Vista/Server 2008 and higher without requiring the system to be booted in debug mode.

    Virtual Machine Troubleshooting

    The rise of virtualization has introduced a new scenario for live kernel debugging: troubleshooting virtual machines. While LiveKd works just as well inside a virtual machine as on a native installation, the ability to examine a running virtual machine without having to install and run LiveKd in the machine would add additional convenience and make it possible to troubleshoot virtual machines that are unresponsive or experiencing issues that would make it impossible to even launch LiveKd. Over the last few years I received requests from Microsoft support engineers for the feature and had started an initial investigation of the approach I’d take to add the support to LiveKd, but I hadn’t gotten around to finishing it.

    Then a couple of months ago, I came across Matthieu Suiche’s LiveCloudKd tool, which enables Hyper-V virtual machine debugging and showed that there was general interest in the capability. We were so impressed that we invited Matthieu to speak about live kernel debugging and LiveCloudKd at this year’s BlueHat Security Briefings, held every year on Microsoft’s campus and taking place this week where I met him. Spurred on by LiveCloudKd, I decided it was time to finish the LiveKd enhancements and sent an email to Ken Johnson, formerly Skywing of Uninformed.org and now a developer in Microsoft’s security group (he had published articles revealing holes in 64-bit Windows “Patchguard” kernel tampering protection several times, so we hired him to help make Windows more secure), asking if he was interested in collaborating. Ken had previously contributed some code to LiveKd that enabled it to run on 64-bit Windows Vista and Windows 7 systems, so working with him was certain to speed the project – little did I know how much. He responded that he’d prototyped a tool for live virtual machine debugging a year before and thought he could incorporate it into LiveKd in a few days. Sure enough, a few days later and the beta of LiveKd 5.0 was finished, complete with the Hyper-V live debugging feature.

    We picked this week to publish it to highlight Matthieu’s tool, which offers some capabilities not present in LiveKd. For example, just like it does for local machine debugging, LiveKd provides a read-only view of the target virtual machine, whereas LiveCloudKd lets you modify it as well.

    LiveKd Hyper-V Debugging

    LiveKd’s Hyper-V support introduces three new command line switches, -p, -hv, and -hvl:

    image 

    When you’re want to troubleshoot a virtual machine, use –hvl to list the names and IDs of the ones that are active:

    image

    Next, use the -hv switch to specify the one you want to examine. You can use either the GUID or the virtual machine’s name, but it’s usually more convenient to use the name if it’s unique:

    image

    And that’s all there is to it. You can now perform the same commands as you can when using LiveKd on a native system, listing processes and threads, dumping memory, and generating crash dump files for later analysis.

    The final switch, -p, pauses the virtual machine while LiveKd is connected. Normally, LiveKd reads pages of physical memory as they’re referenced by the debugger, which means that different pages can represent different points in time. That can lead to inconsistencies, for example, when you view a data structure on a page and then later one the structure references since second structure might have since been deleted. The pause option simply automates the Pause operation you can perform in the Hyper-V Virtual Machine management interface, giving you a frozen-in-time view of the virtual machine while you poke around.

    Have fun debugging virtual machines and please share any troubleshooting success stories that make use of LiveKd’s new capabilities.