Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
As you’ve probably surmised by my blog posts and other writings, I like knowing exactly what my systems are doing. I want to know if a process is running away with the CPU, causing memory pressure, or hitting the disk. Besides keeping my computers running smoothly, my vigilance sometimes helps me spot performance and reliability problems in Windows and third-party code.
The main way I keep tabs on things is to configure Process Explorer to run automatically when I log in. Whenever I configure a new computer, I add a shortcut to Process Explorer to my profile’s Start directory that includes the /t (minimize) switch. Process Explorer runs otherwise hidden with tray icon that shows a small historical view of CPU activity level. Because I want access to detailed information about system processes, as well as my own, I also specify the /e option on Vista, which causes Windows to present a UAC prompt on logon that allows me to grant Process Explorer administrative rights.
Because I keep an eye out for CPU spikes in Process Explorer’s tray icon, which show up as green or red for user-mode (application) and kernel-mode (operating system and drivers) CPU usage, respectively, I’ve identified several application bugs over the last few months. In this post, I’ll share how I used both Process Explorer and another tool, Kernrate, to identify a problem with a third-party driver and followed the problem through to a fix by the vendor.
Not long after I got a new laptop several months ago, I noticed that the system sometimes felt sluggish. Process Explorer’s tray icon corroborated my perception by displaying a mini-graph of red CPU activity. The icon opens a tooltip that reports the name of the process consuming the most CPU when you move the mouse over it, and in this case the tooltip showed the System process as being responsible:
The first few times I noticed the problem, it resolved itself shortly after and I didn’t have a chance to troubleshoot. However, I could see by opening Process Explorer’s System Information dialog that the CPU spikes were significant:
The System process is special because it doesn’t host an executable image like other processes. It exists solely to host operating system threads for the memory manager, cache manager, and other subsystems, as well as device driver threads. These threads execute entirely in kernel mode, which is why System process CPU usage shows up as red in Process Explorer’s graphs.
I suspected that a third-party device driver was the cause of the problem, so the first step in my investigation was to figure out which thread was using CPU, which would hopefully point me at the guilty party. I watched vigilantly for signs of trouble every time I switched networks and jumped the first time I saw one. Process Explorer shows the threads running in a process on the Threads page of the Process Properties dialog, so I double-clicked on the System process and switched to the Threads page the next time I noticed the CPU spike:
The “ntkrnlpa.exe” prefix on each thread’s start address identified the ones I saw at the top of the CPU usage sort order as operating system threads (Ntkrnlpa.exe is the version of the kernel loaded on 32-bit client systems that have no execute memory protection or server systems that need to address more than 4GB of memory). Because I had previously configured Process Explorer to retrieve symbols for operating system images from the Microsoft public symbol server, the thread list also showed the names of the thread start functions. The most active threads began in the ExpWorkerThread function, which means that they were worker threads that perform work on behalf of the system and device drivers. Instead of creating dedicated threads that consume memory resources, the system and drivers can throw work at the shared pool of operating system worker threads.
Unfortunately, knowing that worker threads were causing the CPU usage didn’t get me any closer to solving identifying a root cause. I really needed to know what functions the worker threads were calling, because the functions would be inside the device driver or operating system component on whose behalf the threads were running. One way to look inside a thread’s execution is to look at the thread’s stack with Process Explorer. The stack is a memory region that stores function invocations and Process Explorer will show you a thread’s stack when you select the thread press the Stack button or double-click on the thread. On Vista, however, you get this error when you try and look at the stack for threads in the System process:
The System process is a special type of process on Vista called a “protected process” that doesn’t allow any access to its threads or memory. Protected processes were introduced to support Digital Rights Management (DRM) so that hi-definition content providers can store content encryption keys with a reduced risk of an administrative user using DRM-stripping tools to reach into the process and read the keys.
That approach foiled, I had to find another way to see what the worker threads were doing. For that, I turned to KernRate, a command-line profiling tool that’s a free download from Microsoft. KernRate can profile user-mode processes and kernel-mode threads. It uses the sample-based profiling facility that was introduced in the first release of Windows NT, which records the unique addresses at which the CPU is executing when the profiling interval timer fires. When you stop a profile capture, Kernrate retrieves the information from the kernel, maps the addresses to the loaded device drivers into which the fall, and can even use the symbol engine to report the names of functions.
I wouldn’t need symbols if the trace identified a device driver, so I ran Kernrate without passing it any arguments. Despite the fact that there’s no officially supported version of Kernrate for Vista, the version for Windows XP, Kernrate_i386_XP.exe, works on Vista 32-bit (you can also use the recently-released xperf tool to perform similar profiling - xperf requires Vista or Server 2008, but works on 64-bit versions). I let the profile run through heavy bursts of CPU and then hit Ctrl+C to print the results to the console window:
In first place were hits in the kernel, but in second was a driver that I didn’t recognize, b57nd60x. Most driver files are located in the %systemroot%\system32\drivers directory, so I could have opened that folder and viewed the file’s properties in Explorer, but I had Process Explorer open so a quicker way to check the driver’s vendor and version was to open the DLL view for the System process. The DLL view shows the DLLs and files mapped into the address space of user-mode processes, but for the System process it shows the kernel modules, including drivers, loaded on the system. The DLL view revealed that the driver was for my laptop’s NIC, was from Broadcom, and was version 10.10:
Now that I knew that the Broadcom driver was causing the CPU usage, the next step was to see if there was a newer version available. I went to Dell’s download page for my system, but didn’t find anything. Suspecting that what I noticed might not be a known issue, I decided to notify Broadcom. I used contacts on the hardware ecosystem team here at Microsoft to find the Broadcom driver representative and email him a detailed description of the symptoms and my investigation. He forwarded my email to the driver developer, who acknowledged that they didn’t know the cause and within a few days sent me a debug version of the driver with symbols so that I could capture a Kernrate profile that would tell them what functions in the driver were active during the spikes. The problem reoccurred a few days later and I sent back the kernrate output with function information.
The developer explained that my trace revealed that the driver didn’t efficiently interact with the PCIe bus when processing specific queries and the problem seemed to be exacerbated by my particular hardware configuration. He gave me new driver for me to try and after a few weeks of monitoring my laptop closely for issues, I confirmed that the problem appeared to be resolved. The updated driver has not yet been posted to Dell’s support site, but I expect it to show up there in the near future. Another case closed, this time with Process Explorer, Kernrate, and a helpful Broadcom driver developer.
If you like these troubleshooting blog posts, you’ll enjoy the webcast of my “Case of the Unexplained…” session from TechEd/ITforum. Its 75 minutes are packed with real-world troubleshooting examples, including the one written up in this post and others, as well as some that I haven’t documented. At the end of the session I ask the audience to send me screenshots, log files and descriptions of their own troubleshooting success stories, in return for which I’ll send back a signed copy of Windows Internals. The offer stands, so remember to document your investigation and you can get a free book. I’ve gotten a number of great examples and my next blog post will be a guest post by someone that watched the webcast and used Process Monitor to solve a problem with their web server.
Finally, if you want to see me speak live, come to TechEd US/IT Pro in June in Orlando where I’ll be delivering “The Case of the Unexplained…”, “Windows Server 2008 Kernel Advances”, and “Windows Security Boundaries”. Hope to see you there!
Love this kind of investigation, thanks Mark!
> Because I want access to detailed information about system processes, as well as my own, I also specify the /e option on Vista, which causes Windows to present a UAC prompt on logon that allows me to grant Process Explorer administrative rights
Somewhat off-topic, but what's your opinion of using Vista's Task Scheduler to launch applications at startup/login with elevated privileges?
I use this technique to start Task Manager* elevated without facing a UAC prompt.
*: No offence, but I start Process Explorer for the heavy stuff and leave task manager running always.
Nice and detailed story again! I was at your demo/presentation in Barcelona and suggest people to download and watch the webcast, really cool stuff ;)
Imagine that you tried to get this fixed through the support form on the vendor's web site. That way we would have known if tech-savvy people who don't have a "hardware ecosystem team" to turn to also could manage to get something like this resolved in such a delightful way as you did.
Thanks for these posts on your blog. This is THE best info I have seen for troubleshooting. I love the toolset you created. They are the first tools I go for when there is a problem. They have solved countless problems for me.
The /mimimize switch for process explorer is actually /t, at least for current version :).
Great post, BTW!
Wow, I read this and was supprised to find the exact problems I've been having on my Dell laptop. I had narrowed down the offending problem to the System "process" and was about to dig further. I guess now I'll just wait for the new driver and see if that fixes the problem.
Thanks for the article, not only did it help me fix my problem, it also was very informative in tracking down these kinds of problems.
Excellent Article!! Thank's Mark!
You said you always have Process Explorer running in background whenever you start Windows. Do you have any idea what kind of additional stress this puts on system? I want to do this as well but corncerned that this will eat away my CPU cycles which will eventually cause slow down of system.
Thanks for the great article, especially the kernrate and xperf tools.
I quite like ThreadMaster as a tool/workaround for flattening out CPU spikes, but it's not a match for quality debugging.
Now I've just got to work out how I'm going to hook up kernrate/xperf via Powershell to my MRTG graphs...
>...The System process is a special type of process on Vista called a “protected process”...
Mark, restrictions of running inside protected processes requires that these processes be specially signed.
Ooops...What about malware running as a protected process!!
But does the process need to be signed, or is it the binary image on disk that needs to be signed?
the restrictions of running inside a protected process, the operating system requires that these processes be specially signed
Thanks! Now I know what's causing my problems and was able to repro with the same results as you. Feel like sharing that driver? :)
Thanks, Mark--great article!! Your passion to troubleshoot these issues--and then share in clear, concise detail how you did it--is a fantastic benefit to the community.
I have to echo an earlier comment about your support "backdoor," though--most of us are used to stonewalling and levels of isolation when trying to resolve these types of issues with a vendor (sometimes it feels like communicating with Prince--you have to pass notes to the official correspondent, hope it passes their own filter, pray it's communicated correctly, then perhaps get a response). Then, it seems like it takes a miracle to make it onto the Dell support site--I'm personally waiting on a driver update for an easily reproducible issue with Intel's integrated graphics driver (OptiPlex 745) for Vista. I've reported the issue and followed up several times, but still there's no update. The updated driver has been available from Intel since April 2007, and is even on Windows Update (but not in a way that can be mass-deployed as far as I've tried).
In other words, these articles are brilliant... but unfortunately, even with such skills, it's often exceedingly hard to get a vendor response.
What about KernRate for Windows Server 2003 x64 ?
Perhaps you can find it on Microsoft.com
I found http://www.nynaeve.net/?p=132
Great article, but begs the question of how to get these sort of problems resolved if you don't work for Microsoft.