Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
As you’ve probably surmised by my blog posts and other writings, I like knowing exactly what my systems are doing. I want to know if a process is running away with the CPU, causing memory pressure, or hitting the disk. Besides keeping my computers running smoothly, my vigilance sometimes helps me spot performance and reliability problems in Windows and third-party code.
The main way I keep tabs on things is to configure Process Explorer to run automatically when I log in. Whenever I configure a new computer, I add a shortcut to Process Explorer to my profile’s Start directory that includes the /t (minimize) switch. Process Explorer runs otherwise hidden with tray icon that shows a small historical view of CPU activity level. Because I want access to detailed information about system processes, as well as my own, I also specify the /e option on Vista, which causes Windows to present a UAC prompt on logon that allows me to grant Process Explorer administrative rights.
Because I keep an eye out for CPU spikes in Process Explorer’s tray icon, which show up as green or red for user-mode (application) and kernel-mode (operating system and drivers) CPU usage, respectively, I’ve identified several application bugs over the last few months. In this post, I’ll share how I used both Process Explorer and another tool, Kernrate, to identify a problem with a third-party driver and followed the problem through to a fix by the vendor.
Not long after I got a new laptop several months ago, I noticed that the system sometimes felt sluggish. Process Explorer’s tray icon corroborated my perception by displaying a mini-graph of red CPU activity. The icon opens a tooltip that reports the name of the process consuming the most CPU when you move the mouse over it, and in this case the tooltip showed the System process as being responsible:
The first few times I noticed the problem, it resolved itself shortly after and I didn’t have a chance to troubleshoot. However, I could see by opening Process Explorer’s System Information dialog that the CPU spikes were significant:
The System process is special because it doesn’t host an executable image like other processes. It exists solely to host operating system threads for the memory manager, cache manager, and other subsystems, as well as device driver threads. These threads execute entirely in kernel mode, which is why System process CPU usage shows up as red in Process Explorer’s graphs.
I suspected that a third-party device driver was the cause of the problem, so the first step in my investigation was to figure out which thread was using CPU, which would hopefully point me at the guilty party. I watched vigilantly for signs of trouble every time I switched networks and jumped the first time I saw one. Process Explorer shows the threads running in a process on the Threads page of the Process Properties dialog, so I double-clicked on the System process and switched to the Threads page the next time I noticed the CPU spike:
The “ntkrnlpa.exe” prefix on each thread’s start address identified the ones I saw at the top of the CPU usage sort order as operating system threads (Ntkrnlpa.exe is the version of the kernel loaded on 32-bit client systems that have no execute memory protection or server systems that need to address more than 4GB of memory). Because I had previously configured Process Explorer to retrieve symbols for operating system images from the Microsoft public symbol server, the thread list also showed the names of the thread start functions. The most active threads began in the ExpWorkerThread function, which means that they were worker threads that perform work on behalf of the system and device drivers. Instead of creating dedicated threads that consume memory resources, the system and drivers can throw work at the shared pool of operating system worker threads.
Unfortunately, knowing that worker threads were causing the CPU usage didn’t get me any closer to solving identifying a root cause. I really needed to know what functions the worker threads were calling, because the functions would be inside the device driver or operating system component on whose behalf the threads were running. One way to look inside a thread’s execution is to look at the thread’s stack with Process Explorer. The stack is a memory region that stores function invocations and Process Explorer will show you a thread’s stack when you select the thread press the Stack button or double-click on the thread. On Vista, however, you get this error when you try and look at the stack for threads in the System process:
The System process is a special type of process on Vista called a “protected process” that doesn’t allow any access to its threads or memory. Protected processes were introduced to support Digital Rights Management (DRM) so that hi-definition content providers can store content encryption keys with a reduced risk of an administrative user using DRM-stripping tools to reach into the process and read the keys.
That approach foiled, I had to find another way to see what the worker threads were doing. For that, I turned to KernRate, a command-line profiling tool that’s a free download from Microsoft. KernRate can profile user-mode processes and kernel-mode threads. It uses the sample-based profiling facility that was introduced in the first release of Windows NT, which records the unique addresses at which the CPU is executing when the profiling interval timer fires. When you stop a profile capture, Kernrate retrieves the information from the kernel, maps the addresses to the loaded device drivers into which the fall, and can even use the symbol engine to report the names of functions.
I wouldn’t need symbols if the trace identified a device driver, so I ran Kernrate without passing it any arguments. Despite the fact that there’s no officially supported version of Kernrate for Vista, the version for Windows XP, Kernrate_i386_XP.exe, works on Vista 32-bit (you can also use the recently-released xperf tool to perform similar profiling - xperf requires Vista or Server 2008, but works on 64-bit versions). I let the profile run through heavy bursts of CPU and then hit Ctrl+C to print the results to the console window:
In first place were hits in the kernel, but in second was a driver that I didn’t recognize, b57nd60x. Most driver files are located in the %systemroot%\system32\drivers directory, so I could have opened that folder and viewed the file’s properties in Explorer, but I had Process Explorer open so a quicker way to check the driver’s vendor and version was to open the DLL view for the System process. The DLL view shows the DLLs and files mapped into the address space of user-mode processes, but for the System process it shows the kernel modules, including drivers, loaded on the system. The DLL view revealed that the driver was for my laptop’s NIC, was from Broadcom, and was version 10.10:
Now that I knew that the Broadcom driver was causing the CPU usage, the next step was to see if there was a newer version available. I went to Dell’s download page for my system, but didn’t find anything. Suspecting that what I noticed might not be a known issue, I decided to notify Broadcom. I used contacts on the hardware ecosystem team here at Microsoft to find the Broadcom driver representative and email him a detailed description of the symptoms and my investigation. He forwarded my email to the driver developer, who acknowledged that they didn’t know the cause and within a few days sent me a debug version of the driver with symbols so that I could capture a Kernrate profile that would tell them what functions in the driver were active during the spikes. The problem reoccurred a few days later and I sent back the kernrate output with function information.
The developer explained that my trace revealed that the driver didn’t efficiently interact with the PCIe bus when processing specific queries and the problem seemed to be exacerbated by my particular hardware configuration. He gave me new driver for me to try and after a few weeks of monitoring my laptop closely for issues, I confirmed that the problem appeared to be resolved. The updated driver has not yet been posted to Dell’s support site, but I expect it to show up there in the near future. Another case closed, this time with Process Explorer, Kernrate, and a helpful Broadcom driver developer.
If you like these troubleshooting blog posts, you’ll enjoy the webcast of my “Case of the Unexplained…” session from TechEd/ITforum. Its 75 minutes are packed with real-world troubleshooting examples, including the one written up in this post and others, as well as some that I haven’t documented. At the end of the session I ask the audience to send me screenshots, log files and descriptions of their own troubleshooting success stories, in return for which I’ll send back a signed copy of Windows Internals. The offer stands, so remember to document your investigation and you can get a free book. I’ve gotten a number of great examples and my next blog post will be a guest post by someone that watched the webcast and used Process Monitor to solve a problem with their web server.
Finally, if you want to see me speak live, come to TechEd US/IT Pro in June in Orlando where I’ll be delivering “The Case of the Unexplained…”, “Windows Server 2008 Kernel Advances”, and “Windows Security Boundaries”. Hope to see you there!
Helpful article - thanks!
I wonder if the tools can help in the same way to show the reason for the mysterious 90-100% CPU Windows Explorer hits that come and go for long periods in Vista. Can these be due to 3rd party code in worker threads too?
Mark, thanks for this, just the advice on getting Process Explorer running on boot was invaluable; a 450mHz Pentium III is at full stretch with XP.
For the person who asked earlier about the processor usage of ProEx, it is 1%-2% on my rather elderly machine.
Great article! But...who is running Vista?
Awesome article! Unfortunately, we are all stretched so thin in IT that we rarely have the time to become such experts in this area. Mark, makes it sound so easy ( That is why he has a Ph.d ), however any Systems administrator knows how difficult such a problem is to isolate, and even more so, it it is intermittent.
San Jose, CA
I really liked this article but whats killing me is commit history commit section where I cant tell what is taking up all my space. Is there someway to see the memory hogs and kill them like you can the cpu hogs?
I run Process Explorer as a service when the machine boots by using Microsoft's SrvAny.
It not only ensures that Process Explorer is running every time the machine boots it also allows me to logon as a normal User and have Process Explorer running as NT AUTHORITY\SYSTEM.
I also have an AT command set to run it at 7am everyday so if it has exited the previous day its there running again each morning again.
Mark, can you ask Broadcom if the changes to the driver had anything to do with TCP/Offload or TOE?
I work for a large hosting company and we have had issues with Broadcom drivers on both DELL and HP servers.
If \HKEY_USERS\...\Sysinternals\Handle\EulaAccepted = 0 in the Registry, then handle.exe /accepteula does not work correctly.
@Jose: And how hard it is to get support from the NIC vendor (since support is usually through a computer manufacturer whose support, honestly, isn't geared toward troubleshooting NIC driver problems.
it would be nice if we could all look up a dev in a hardware manufacturer database and send them and email...
Thanks for the interesting post.
I had the same problem and following your article I was able to detect that my b57nd60a.sys driver is causing the same problem. I have a HP laptop, but apparently it has the same network card by Broadcom. I have version 10.10 of the driver and I am now waiting for HP to release a new driver.
Since I have a 64bit Vista, I used xperf. This is what I have done:
1) Installed Windows Performance Tools Kit from the link you provided
2) ran: C:\Program Files\Microsoft Windows Performance Toolkit>xperf -start -on diag
3) ran: C:\Program Files\Microsoft Windows Performance Toolkit>xperf -stop -d c:\temp\result.etl
4) Opened the ETL file in the viewer.
5) Clicked "Summary Table" on a CPU graph.
6) Expanded until I found the problematic item.
Very interesting indeed.
It was a great read! I can just say that maybe, to avoid this kind of investigations, one might just use Linux...
I am sending out a prayer.
All I ask is that Windows 7's top priority is speed. It should be faster than XP.
Vista was never even an option for me after I discovered lower benchmarks across the board.
Within my circle of friends, not a single one even considers Vista an option due to its performance problems.
Don't release Windows7 until you've got a codebase that is faster than XP, period.
Great post as usual.. Im a frequent user of the tools that you have written. Process Explorer and Process Monitor have helped me troubleshoot quite some problems.. I have started documenting some of my ordeals too. You have asked us to share our success stories with you, but you haven't mentioned how? Psst.. whats your email address?
I found the article very helpful, so thanks Mark. Looking beyond desktop machines, can this be implmeneted to troubleshoot servers as well? Working in an organisation that serves internet web applications, is there any way to run these tools remotely rather than from the environment itself?
Look forward to you next article.
Great article! We happen to be analyzing resource spikes (cpu,ram,hd)in client PC's with layered security built from several best-of-breed security products, so this is very helpful to us. Hopefully we'll be able to tweak layered security performance to eliminate unnecessary resource hogs but still maintain comprehensive security protection.