Welcome to TechNet Blogs Sign in | Join | Help

The Case of the System Process CPU Spikes

As you’ve probably surmised by my blog posts and other writings, I like knowing exactly what my systems are doing. I want to know if a process is running away with the CPU, causing memory pressure, or hitting the disk. Besides keeping my computers running smoothly, my vigilance sometimes helps me spot performance and reliability problems in Windows and third-party code.

The main way I keep tabs on things is to configure Process Explorer to run automatically when I log in. Whenever I configure a new computer, I add a shortcut to Process Explorer to my profile’s Start directory that includes the /t (minimize) switch. Process Explorer runs otherwise hidden with tray icon that shows a small historical view of CPU activity level. Because I want access to detailed information about system processes, as well as my own, I also specify the /e option on Vista, which causes Windows to present a UAC prompt on logon that allows me to grant Process Explorer administrative rights.

Because I keep an eye out for CPU spikes in Process Explorer’s tray icon, which show up as green or red for user-mode (application) and kernel-mode (operating system and drivers) CPU usage, respectively, I’ve identified several application bugs over the last few months. In this post, I’ll share how I used both Process Explorer and another tool, Kernrate, to identify a problem with a third-party driver and followed the problem through to a fix by the vendor.

Not long after I got a new laptop several months ago, I noticed that the system sometimes felt sluggish. Process Explorer’s tray icon corroborated my perception by displaying a mini-graph of red CPU activity. The icon opens a tooltip that reports the name of the process consuming the most CPU when you move the mouse over it, and in this case the tooltip showed the System process as being responsible:

 The first few times I noticed the problem, it resolved itself shortly after and I didn’t have a chance to troubleshoot. However, I could see by opening Process Explorer’s System Information dialog that the CPU spikes were significant:

 

The System process is special because it doesn’t host an executable image like other processes. It exists solely to host operating system threads for the memory manager, cache manager, and other subsystems, as well as device driver threads.  These threads execute entirely in kernel mode, which is why System process CPU usage shows up as red in Process Explorer’s graphs.

I suspected that a third-party device driver was the cause of the problem, so the first step in my investigation was to figure out which thread was using CPU, which would hopefully point me at the guilty party. I watched vigilantly for signs of trouble every time I switched networks and jumped the first time I saw one. Process Explorer shows the threads running in a process on the Threads page of the Process Properties dialog, so I double-clicked on the System process and switched to the Threads page the next time I noticed the CPU spike:

The “ntkrnlpa.exe” prefix on each thread’s start address identified the ones I saw at the top of the CPU usage sort order as operating system threads (Ntkrnlpa.exe is the version of the kernel loaded on 32-bit client systems that have no execute memory protection or server systems that need to address more than 4GB of memory). Because I had previously configured Process Explorer to retrieve symbols for operating system images from the Microsoft public symbol server, the thread list also showed the names of the thread start functions. The most active threads began in the ExpWorkerThread function, which means that they were worker threads that perform work on behalf of the system and device drivers. Instead of creating dedicated threads that consume memory resources, the system and drivers can throw work at the shared pool of operating system worker threads.

Unfortunately, knowing that worker threads were causing the CPU usage didn’t get me any closer to solving identifying a root cause. I really needed to know what functions the worker threads were calling, because the functions would be inside the device driver or operating system component on whose behalf the threads were running. One way to look inside a thread’s execution is to look at the thread’s stack with Process Explorer. The stack is a memory region that stores function invocations and Process Explorer will show you a thread’s stack when you select the thread press the Stack button or double-click on the thread. On Vista, however, you get this error when you try and look at the stack for threads in the System process:

The System process is a special type of process on Vista called a “protected process” that doesn’t allow any access to its threads or memory. Protected processes were introduced to support Digital Rights Management (DRM) so that hi-definition content providers can store content encryption keys with a reduced risk of an administrative user using DRM-stripping tools to reach into the process and read the keys.

That approach foiled, I had to find another way to see what the worker threads were doing. For that, I turned to KernRate, a command-line profiling tool that’s a free download from Microsoft. KernRate can profile user-mode processes and kernel-mode threads. It uses the sample-based profiling facility that was introduced in the first release of Windows NT, which records the unique addresses at which the CPU is executing when the profiling interval timer fires. When you stop a profile capture, Kernrate retrieves the information from the kernel, maps the addresses to the loaded device drivers into which the fall, and can even use the symbol engine to report the names of functions.

I wouldn’t need symbols if the trace identified a device driver, so I ran Kernrate without passing it any arguments. Despite the fact that there’s no officially supported version of Kernrate for Vista, the version for Windows XP, Kernrate_i386_XP.exe, works on Vista 32-bit (you can also use the recently-released xperf tool to perform similar profiling - xperf requires Vista or Server 2008, but works on 64-bit versions). I let the profile run through heavy bursts of CPU and then hit Ctrl+C to print the results to the console window:

In first place were hits in the kernel, but in second was a driver that I didn’t recognize, b57nd60x. Most driver files are located in the %systemroot%\system32\drivers directory, so I could have opened that folder and viewed the file’s properties in Explorer, but I had Process Explorer open so a quicker way to check the driver’s vendor and version was to open the DLL view for the System process. The DLL view shows the DLLs and files mapped into the address space of user-mode processes, but for the System process it shows the kernel modules, including drivers, loaded on the system. The DLL view revealed that the driver was for my laptop’s NIC, was from Broadcom, and was version 10.10:

 

Now that I knew that the Broadcom driver was causing the CPU usage, the next step was to see if there was a newer version available. I went to Dell’s download page for my system, but didn’t find anything. Suspecting that what I noticed might not be a known issue, I decided to notify Broadcom. I used contacts on the hardware ecosystem team here at Microsoft to find the Broadcom driver representative and email him a detailed description of the symptoms and my investigation. He forwarded my email to the driver developer, who acknowledged that they didn’t know the cause and within a few days sent me a debug version of the driver with symbols so that I could capture a Kernrate profile that would tell them what functions in the driver were active during the spikes. The problem reoccurred a few days later and I sent back the kernrate output with function information.

The developer explained that my trace revealed that the driver didn’t efficiently interact with the PCIe bus when processing specific queries and the problem seemed to be exacerbated by my particular hardware configuration. He gave me new driver for me to try and after a few weeks of monitoring my laptop closely for issues, I confirmed that the problem appeared to be resolved. The updated driver has not yet been posted to Dell’s support site, but I expect it to show up there in the near future. Another case closed, this time with Process Explorer, Kernrate, and a helpful Broadcom driver developer.

If you like these troubleshooting blog posts, you’ll enjoy the webcast of my “Case of the Unexplained…” session from TechEd/ITforum. Its 75 minutes are packed with real-world troubleshooting examples, including the one written up in this post and others, as well as some that I haven’t documented. At the end of the session I ask the audience to send me screenshots, log files and descriptions of their own troubleshooting success stories, in return for which I’ll send back a signed copy of Windows Internals. The offer stands, so remember to document your investigation and you can get a free book. I’ve gotten a number of great examples and my next blog post will be a guest post by someone that watched the webcast and used Process Monitor to solve a problem with their web server.

Finally, if you want to see me speak live, come to TechEd US/IT Pro in June in Orlando where I’ll be delivering “The Case of the Unexplained…”, “Windows Server 2008 Kernel Advances”, and “Windows Security Boundaries”. Hope to see you there!

Inside Vista SP1 File Copy Improvements

Windows Vista SP1 includes a number of enhancements over the original Vista release in the areas of application compatibility, device support, power management, security and reliability. You can see a detailed list of the changes in the Notable Changes in Windows Vista Service Pack 1 whitepaper that you can download here. One of the improvements highlighted in the document is the increased performance of file copying for multiple scenarios, including local copies on the same disk, copying files from remote non-Windows Vista systems, and copying files between SP1 systems. How were these gains achieved? The answer is a complex one and lies in the changes to the file copy engine between Windows XP and Vista and further changes in SP1. Everyone copies files, so I thought it would be worth taking a break from the “Case of…” posts and dive deep into the evolution of the copy engine to show how SP1 improves its performance.

 

Copying a file seems like a relatively straightforward operation: open the source file, create the destination, and then read from the source and write to the destination. In reality, however, the performance of copying files is measured along the dimensions of accurate progress indication, CPU usage, memory usage, and throughput. In general, optimizing one area causes degradation in others. Further, there is semantic information not available to copy engines that could help them make better tradeoffs. For example, if they knew that you weren’t planning on accessing the target of the copy operation they could avoid caching the file’s data in memory, but if it knew that the file was going to be immediately consumed by another application, or in the case of a file server, client systems sharing the files, it would aggressively cache the data on the destination system.

 

File Copy in Previous Versions of Windows

In light of all the tradeoffs and imperfect information available to it, the Windows file copy engine tries to handle all scenarios well. Prior to Windows Vista, it took the straightforward approach of opening both the source and destination files in cached mode and marching sequentially through the source file reading 64KB (60KB for network copies because of an SMB1.0 protocol limit on individual read sizes) at a time and writing out the data to the destination as it went. When a file is accessed with cached I/O, as opposed to memory-mapped I/O or I/O with the no-buffering flag, the data read or written is stored in memory, at least until the Memory Manager decides that the memory should be repurposed for other uses, including caching the data of other files.

 

The copy engine relied on the Windows Cache Manager to perform asynchronous read-ahead, which essentially reads the source file in the background while Explorer is busy writing data to a different disk or a remote system. It also relied on the Cache Manager’s write-behind mechanism to flush the copied file’s contents from memory back to disk in a timely manner so that the memory could be quickly repurposed if necessary, and so that data loss is minimized in the face of a disk or system failure. You can see the algorithm at work in this Process Monitor trace of a 256KB file being copied on Windows XP from one directory to another with filters applied to focus on the data reads and writes:

 

 

Explorer’s first read operation at event 0 of data that’s not present in memory causes the Cache Manager to perform a non-cached I/O, which is an I/O that reads or writes data directly to the disk without caching it in memory, to fetch the data from disk at event 1, as seen in the stack trace for event 1:

 

 

In the stack trace, Explorer’s call to ReadFile is at frame 22 in its BaseCopyStream function and the Cache Manager invokes the non-cached read indirectly by touching the memory mapping of the file and causing a page fault at frame 8.

 

Because Explorer opens the file with the sequential-access hint (not visible in trace), the Cache Manager’s read-ahead thread, running in the System process, starts to aggressively read the file on behalf of Explorer at events 2 and 3. You can see the read-ahead functions in the stack for event 2:

 

 

You may have noticed that the read-ahead reads are initially out of order with respect to the original non-cached read caused by the first Explorer read, which can cause disk head seeks and slow performance, but Explorer stops causing non-cached I/Os when it catches up with the data already read by the Cache Manager and its reads are satisfied from memory.  The Cache Manager generally stays 128KB ahead of Explorer during file copies.

 

At event 4 in the trace, Explorer issues the first write and then you see a sequence of interleaved reads and writes. At the end of the trace the Cache Manager’s write-behind thread, also running in the System process, flushes the target file’s data from memory to disk with non-cached writes.

 

Vista Improvements to File Copy

During Windows Vista development, the product team revisited the copy engine to improve it for several key scenarios. One of the biggest problems with the engine’s implementation is that for copies involving lots of data, the Cache Manager write-behind thread on the target system often can’t keep up with the rate at which data is written and cached in memory. That causes the data to fill up memory, possibly forcing other useful code and data out, and eventually, the target’s system’s memory to become a tunnel through which all the copied data flows at a rate limited by the disk.  

 

Another problem they noted was that when copying from a remote system, the file’s contents are cached twice on the local system: once as the source file is read and a second time as the target file is written. Besides causing memory pressure on the client system for files that likely won’t be accessed again, involving the Cache Manager introduces the CPU overhead that it must perform to manage its file mappings of the source and destination files.

 

A limitation of the relatively small and interleaved file operations is that the SMB file system driver, the driver that implements the Windows remote file sharing protocol, doesn’t have opportunities to pipeline data across high-bandwidth, high-latency networks like WLANs. Every time the local system waits for the remote system to receive data, the data flowing across the network drains and the copy pays the latency cost as the two systems wait for the each other’s acknowledgement and next block of data.

 

After studying various alternatives, the team decided to implement a copy engine that tended to issue large asynchronous non-cached I/Os, addressing all the problems they had identified. With non-cached I/Os, copied file data doesn’t consume memory on the local system, hence preserving memory’s existing contents. Asynchronous large file I/Os allow for the pipelining of data across high-latency network connections, and CPU usage is decreased because the Cache Manager doesn’t have to manage its memory mappings and inefficiencies of the original Vista Cache Manager for handling large I/Os contributed to the decision to use non-cached I/Os. They couldn’t make I/Os arbitrarily large, however, because the copy engine needs to read data before writing it, and performing reads and writes concurrently is desirable, especially for copies to different disks or systems. Large I/Os also pose challenges for providing accurate time estimates to the user because there are fewer points to measure progress and update the estimate. The team did note a significant downside of non-cached I/Os, though: during a copy of many small files the disk head constantly moves around the disk, first to a source file, then to destination, back to another source, and so on.

 

After much analysis, benchmarking and tuning, the team implemented an algorithm that uses cached I/O for files smaller than 256KB in size. For files larger than 256KB, the engine relies on an internal matrix to determine the number and size of non-cached I/Os it will have in flight at once. The number ranges from 2 for files smaller than 2MB to 8 for files larger than 8MB. The size of the I/O is the file size for files smaller than 1MB, 1MB for files up to 2MB, and 2MB for anything larger.

 

To copy a file 16MB file, for example, the engine issues eight 2MB asynchronous non-cached reads of the source file, waits for the I/Os to complete, issues eight 2MB asynchronous non-cached writes of the destination, waits again for the writes to complete, and then repeats the cycle. You can see that pattern in this Process Monitor trace of a 16MB file copy from a local system to a remote one:

 

 

While this algorithm is an improvement over the previous one in many ways, it does have some drawbacks. One that occurs sporadically on network file copies is out-of-order write operations, one of which is visible in this trace of the receive side of a copy:

 

 

Note how the write operation offsets jump from 327,680 to 458,752, skipping the block at offset 393,216. That skip causes a disk head seek and forces NTFS to issue an unnecessary write operation to the skipped region to zero that part of the file, which is why there are two writes to offset 393,216. You can see NTFS calling the Cache Manager’s CcZeroData function to zero the skipped block in the stack trace for the highlighted event:

 

 

A bigger problem with using non-cached I/O is that performance can suffer in publishing scenarios. If you copy a group of files to a file share that represents the contents of a Web site for example, the Web server must read the files from disk when it first accesses them. This obviously applies to servers, but most copy operations are publishing scenarios even on client systems, because the appearance of new files causes desktop search indexing, triggers antivirus and antispyware scans, and queues Explorer to generate thumbnails for display on the parent directory’s folder icon.

 

Perhaps the biggest drawback of the algorithm, and the one that has caused many Vista users to complain, is that for copies involving a large group of files between 256KB and tens of MB in size, the perceived performance of the copy can be significantly worse than on Windows XP. That’s because the previous algorithm’s use of cached file I/O lets Explorer finish writing destination files to memory and dismiss the copy dialog long before the Cache Manager’s write-behind thread has actually committed the data to disk; with Vista’s non-cached implementation, Explorer is forced to wait for each write operation to complete before issuing more, and ultimately for all copied data to be on disk before indicating a copy’s completion. In Vista, Explorer also waits 12 seconds before making an estimate of the copy’s duration and the estimation algorithm is sensitive to fluctuations in the copy speed, both of which exacerbate user frustration with slower copies.

 

SP1 Improvements

During Vista SP1’s development, the product team decided to revisit the copy engine to explore ways to improve both the real and perceived performance of copy operations for the cases that suffered in the new implementation. The biggest change they made was to go back to using cached file I/O again for all file copies, both local and remote, with one exception that I’ll describe shortly. With caching, perceived copy time and the publishing scenario both improve. However, several significant changes in both the file copy algorithm and the platform were required to address the shortcomings of cached I/O I’ve already noted.

 

The one case where the SP1 file copy engine doesn't use caching is for remote file copies, where it prevents the double-caching problem by leveraging support in the Windows client-side remote file system driver, Rdbss.sys. It does so by issuing a command to the driver that tells it not to cache a remote file on the local system as it is being read or written. You can see the command being issued by Explorer in the following Process Monitor capture:

 

 

Another enhancement for remote copies is the pipelined I/Os issued by the SMB2 file system driver, srv2.sys, which is new to Windows Vista and Windows Server 2008. Instead of issuing 60KB I/Os across the network like the original SMB implementation, SMB2 issues pipelined 64KB I/Os so that when it receives a large I/O from an application, it will issue multiple 64KB I/Os concurrently, allowing for the data to stream to or from the remote system with fewer latency stalls.

 

The copy engine also issues four initial I/Os of sizes ranging from 128KB to 1MB, depending on the size of the file being copied, which triggers the Cache Manager read-ahead thread to issue large I/Os. The platform change made in SP1 to the Cache Manager has it perform larger I/O for both read-ahead and write-behind. The larger I/Os are only possible because of work done in the original Vista I/O system to support I/Os larger than 64KB, which was the limit in previous versions of Windows. Larger I/Os also improve performance on local copies because there are fewer disk accesses and disk seeks, and it enables the Cache Manager write-behind thread to better keep up with the rate at which memory fills with copied file data. That reduces, though not necessarily eliminates, memory pressure that causes active memory contents to be discarded during a copy. Finally, for remote copies the large I/Os let the SMB2 driver use pipelining. The Cache Manager issues read I/Os that are twice the size of the I/O issued by the application, up to a maximum of 2MB on Vista and 16MB on Server 2008, and write I/Os of up to 1MB in size on Vista and up to 32MB on Server 2008.

 

This trace excerpt of a 16MB file copy from one SP1 system to another shows 1MB I/Os issued by Explorer and a 2MB Cache Manager read-ahead, which is distinguished by its non-cached I/O flag:

 

 

Unfortunately, the SP1 changes, while delivering consistently better performance than previous versions of Windows, can be slower than the original Vista release in a couple of specific cases. The first is when copying to or from a Server 2003 system over a slow network. The original Vista copy engine would deliver a high-speed copy, but, because of the out-of-order I/O problem I mentioned earlier, trigger pathologic behavior in the Server 2003 Cache Manager that could cause all of the server’s memory to be filled with copied file data. The SP1 copy engine changes avoid that, but because the engine issues 32KB I/Os instead of 60KB I/Os, the throughput it achieves on high-latency connections can approach half of what the original Vista release achieved.

 

The other case where SP1 might not perform as well as original Vista is for large file copies on the same volume. Since SP1 issues smaller I/Os, primarily to allow the rest of the system to have better access to the disk and hence better responsiveness during a copy, the number of disk head seeks between reads from the source and writes to the destination files can be higher, especially on disks that don’t avoid seeks with efficient internal queuing algorithms.

 

One final SP1 change worth mentioning is that Explorer makes copy duration estimates much sooner than the original Vista release and the estimation algorithm is more accurate.

 

Summary

File copying is not as easy as it might first appear, but the product team took feedback they got from Vista customers very seriously and spent hundreds of hours evaluating different approaches and tuning the final implementation to restore most copy scenarios to at least the performance of previous versions of Windows and drastically improve some key scenarios. The changes apply both to Explorer copies as well as to ones initiated by applications using the CopyFileEx API and you’ll see the biggest improvements over older versions of Windows when copying files on high-latency, high-bandwidth networks where the large I/Os, SMB2’s I/O pipelining, and Vista’s TCP/IP stack receive-window auto-tuning can literally deliver what would be a ten minute copy on Windows XP or Server 2003 in one minute. Pretty cool.

The Case of the Missing AutoPlay

 I’ve been presenting talks on Windows Vista kernel changes since TechEd US in the summer of 2006 and one of the features I cover in the session is ReadyBoost, a write-through disk caching technology that can potentially improve system performance by leveraging flash media as a disk cache. I explain ReadyBoost in depth in my TechNet Magazine article, “Inside the Windows Vista Kernel: Part 2”, but the basic idea is that, since flash has significantly better random access latency than disk, ReadyBoost intercepts disk accesses and directs random-access reads to its cache when the cache holds the data, but sends sequential access to directly to the disk. During my presentation, I insert a USB key, whereupon Windows displays an AutoPlay dialog that includes an option to configure the device for ReadyBoost caching:

 

The first time I gave the talk, the demonstration went flawlessly, but in subsequent deliveries I didn’t get the AutoPlay experience. I would notice the lack of AutoPlay as I ran through the demonstrations before a session, but was always pressed for time and so couldn’t investigate. As a workaround, I would manual open the properties dialog of the device’s volume after insertion to show the ReadyBoost page that’s displayed when you click on the “Speed up my system” link on the AutoPlay dialog.

The last time I presented the session, at TechEd/ITforum in Barcelona in November, I had some extra time beforehand so I decided to find out why AutoPlay wasn’t working. The first thing I did was to check the AutoPlay settings, which you configure in the AutoPlay section of the Control Panel’s Hardware and Sound page. Some of the entries were set to “Ask me every time”, which shouldn’t have had any effect, and even after resetting to the defaults, AutoPlay still didn’t work:

 

At this point I had to look under the hood at an insertion’s associated Registry and file system activity to see if that would reveal the reason why Explorer wasn’t honoring the Control Panel’s AutoPlay settings. I ran Process Monitor, configured the filter to include Explorer’s Registry operations, and re-inserted the key. Then I stopped the capture and looked at what Process Monitor had collected.

A staggering 22,000 events meant that scanning through the trace event-by-event would take hours and there were no obvious error codes to search for, so I had to think of some keyword that might lead me to the relevant lines. I first searched for “autoplay”, but came up empty. I knew that Explorer looks for a file named Autorun.inf in the root directory of removable media volumes, which can contain pointers to an icon to show for the volume and an executable that launches when the user double-clicks on the volume, so I next searched for “autorun”. The first hit didn’t look interesting because it referred to the volume’s mount-point GUID, information that Windows generates dynamically when it notices a new volume:

 

The next hits were just a few entries later and all referred to values that store Group Policy settings:

 

The queries of the first two locations resulted in NAME NOT FOUND errors, indicating that the policies weren’t defined, but a query of HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer\NoDriveTypeAutoRun was successful. Process Monitor showed the value Explorer had read in the Details column:

 

I didn’t know how to interpret a setting of 255, so I executed a Web search for “nodrivetypeautorun” and found a page in the Windows 2000 Resource Kit that describes the value as a bitmask specifying which device types have AutoPlay disabled. A value of 255 decimal (0xFF hexadecimal) disables AutoPlay on all devices:

 

I used Process Monitor’s Jump-To functionality to launch Regedit and navigate directly to the value, opened the value editor, and changed the setting to 0 to enable AutoPlay on all devices. Next I had to test the change. I removed and reinserted the key and, to my satisfaction, the AutoPlay dialog appeared. Note that on Windows Vista, AutoPlay no longer means "automatically execute what's in Autorun.inf", but rather, "show me my options", so I wasn't introcuding a potential security issue.

The case was almost closed, but I had one detail to wrap up. AutoPlay was disabled on my system by the Group Policy configuration of the Microsoft domain to which the system is joined. That explained why the demonstration had worked for the first few times: my first deliveries of the session were before I had joined Microsoft. It also meant that the value would get restored to its previous setting the next time I logged on and Group Policy reapplied the domain’s configuration. If I happened to logon before the session the demonstration would break again.

There’s no way to opt out of Group Policy updates short of removing the system from the domain or never connecting to the domain. However, because I have local administrative rights, I realized that I could prevent Group Policy from changing the value by setting the permissions on the policy’s key such that Group Policy wouldn’t have permission to do so. Group Policy processing occurs in the Local System account, so I opened Regedit’s permissions editor and removed write access for the Local System account:

 

I was now confident that the demonstration would work for my current delivery of the Vista Kernel Changes session, as well as any future ones, and I closed the case. Besides highlighting Process Monitor’s usefulness for uncovering a root cause, this example also illustrates the power of local administrative rights. A local administrator is the master of the computer and is able to do anything they want, including circumventing domain policies, something I covered in a previous blog post, and that's just one more reason enterprises should strive to have their end users run as standard users.

The Case of the Frozen Clock Gadget

Besides Aero Glass, one of the most visible features of Windows Vista is the Sidebar with its set of default Gadgets, like the clock, RSS feed, and photo viewer. The convenience of having frequently-accessed information on the desktop and the ease of their development has led to the availability of literally thousands of third-party Gadgets through sites like the Windows Vista Gadget Gallery. I’ve downloaded and installed a few out of curiosity, and in some cases kept them in my Sidebar’s standard configuration, and never experienced a problem.  A few days after installing a batch of new Gadgets, however, I noticed that a third-party clock Gadget had stopped updating, and so I set out to investigate.

My system was otherwise functioning normally, so my first step was to see if something was amiss with the Sidebar’s configuration. I right-clicked on the Sidebar screen area and selected the Properties menu item, but instead of displaying the Sidebar configuration dialog, the Sidebar crashed:

Gadgets run inside of shared Sidebar processes, so my first thought was that memory corruption in the Sidebar process had caused the clock to stop and subsequent crash, and verifying that theory required that I analyze the crash. The Windows Error Reporting (WER) service creates a crash-dump file, which is the saved state of a faulting process, in case you agree to send information to Microsoft about a problem. I clicked open the View Details area to see where Windows had saved the dump:

The last path displayed by the dialog, WERD8EE.tmp.mdmp, is a dump file, so I launched the Microsoft Debugging Tools for Windows Windbg utility and opened the file. When you open a dump file, Windbg automatically shows you the instruction that ultimately lead to the crash. In this case, it was a memory copy operation in Msvcrt, the Microsoft C Runtime:

The right side of the line showing the instruction indicates that the target address of the copy is 0. When a memory resource is exhausted, memory-allocation functions typically return address 0, also known as a NULL pointer, because that’s an illegal address by default for a Windows process (an application can manually create read/write memory at address zero, but in general it’s not done). The fact that Sidebar referenced address 0 didn’t conclusively mean the crash was due to low-memory instead of corruption, but it appeared likely.

I next looked at the code that led to the crash, which would tell me if it was a Gadget or the Sidebar itself that had passed a NULL pointer to the C Runtime. To do so, I opened Windbg’s stack dialog:

I had previously configured Windbg’s symbol path to point at the Microsoft symbol server so that Windbg reports names of internal functions in Windows images, because knowing function names can often make understanding a dump file easier. The functions listed in the stack trace implied that Sidebar was querying the version of a “package” when it crashed. I’m not sure what the Sidebar calls a package, but the trace did seem to show that Sidebar was the culprit, not a Gadget.

So had Sidebar run out of memory? There are several types of resource exhaustion that can cause a memory allocation to fail. For example, the system could have run out of committable memory, the process could have consumed all the memory in its own address space, or an internal heap could have reached its maximum size.

I started by checking the committed memory, since that was quick. Total commitable memory, also known as the commit limit, is the sum of the paging file(s) and most of physical memory. When commitable memory runs low, Windows Vista’s low-resource watchdog warns you by presenting a list of processes consuming the most memory and gives you the option of terminating them to relieve the memory pressure. I hadn’t seen a warning, so I doubted that this was the cause, but opened Process Explorer’s System Information dialog to check anyway:

 

As I suspected, there was plenty of available Committable memory. I next looked at Sidebar’s virtual memory usage. Memory leaks are caused when a process allocates virtual memory, stores some data in it, uses the data, but doesn’t free the memory when it’s done with the data. Virtual memory that processes allocate to store their own data is called Private Bytes, so I opened Process Explorer and added the Private Bytes column:

On a 32-bit Windows system, processes have 2 GB of address space available to them by default, so the highest possible Private Bytes value is close to 2 GB, which is exactly what the Sidebar process with process ID 4680 had consumed. That confirmed it: a memory leak in Sidebar caused it to run out of address space, which in turned caused a memory allocation to fail, which finally caused a NULL-pointer reference and a crash. I suspect that the clock stopped when Sidebar’s address space was exhausted and the clock Gadget couldn’t allocate resources to update its graphic.

Next I had to determine which Gadget was causing the leak, which may or may not have been the frozen clock Gadget. The Sidebar consists of two processes, one Sidebar.exe process that hosts the Windows Gadgets and a child Sidebar.exe process for third-party Gadgets. At this point I knew that a third-party Gadget had leaked memory or caused the Sidebar to leak, but I had several third-party Gadgets running and I didn’t know which one to blame. Unfortunately, the Sidebar offers no way to track memory usage by Gadget (or any other resource usage for that matter), so I had to apply manual steps to isolate the leak.

After restarting the Sidebar, I removed the third-party Gadgets and added them back one at a time, leaving each to run for a minute or two while I monitored Sidebar’s Private Bytes usage. I added the Private Bytes Delta column to Process Explorer’s display to make it easy to spot increases, and after adding one of the Gadgets I started to see periodic positive Private Bytes Delta values, implicating it as the leaker:

Now that I knew the guilty Gadget, I could have simply uninstalled it and considered the case closed. But I was curious to know how the Gadget had managed to cause a leak in the Sidebar – a leak that persisted even after I removed the Gadget.

I navigated to the Gadget’s install directory and opened its HTML file to see what it was doing. The Gadget consists of around 3-dozen lines of pretty simple Javascript and I didn’t spot anything amiss. To narrow in on the problematic code, I began commenting out pieces and re-adding the Gadget to the Sidebar until the leak disappeared. The code I was left with was a function the Gadget configured to execute every 10 seconds to update its graphics. It called the Sidebar background object’s RemoveObjects method and then added back graphics and text by calling the background’s AddImageObject method. Here’s a simplified version of the relevant code:

The fact that it was using these APIs correctly meant that the leak was in the Sidebar’s code, but a quick Internet search didn’t turn up any mentions of a leak in the background object. If Sidebar APIs had a memory leak, why wasn’t it well known? I scanned the source code to the other Gadget’s on my system and discovered that none of them used the APIs, which explained why the leak isn't commonly encountered. However, comments in the Windows Gadget Gallery for the Gadget that inadvertently caused the leak revealed that other users had noticed it.

Having tracked the original unresponsive Gadget problem down to a leaky Sidebar API, I filed a bug in the Windows bug database and closed the case.

The Case of the Failed File Copy

The other day a friend of mine called me to tell me that he was having a problem copying pictures to a USB flash drive. He’d been able to copy over two hundred files when he got this error dialog, after which he couldn’t copy any more without getting the same message:

Unfortunately, the message, “The directory or file cannot be created”, provides no clue as to the underlying cause and the dialog explains that the error is unexpected and does not suggest where you can find the “additional help” to which it refers.  My friend was sophisticated enough to make sure the drive had plenty of free space and he ran Chkdsk to check for corruption, but the scan didn’t find any problem and the error persisted on subsequent attempts to copy more files to the drive. At a loss, he turned to me.

I immediately asked him to capture a trace with Process Monitor, a real-time file system and registry monitoring tool, which would offer a look underneath the dialogs to reveal actual operating system errors returned by the file system.  He sent me the resulting Process Monitor PML file, which I opened on my own system. After setting a filter for the volume in question to narrow the output to just the operations related to the file copy, I went to the end of the trace to look back for errors. I didn’t have to look far, because the last line appeared to be the operation with the error causing the dialog:

To save screen space, Process Monitor strips the “STATUS” prefix from the errors it displays, so the actual operating system error is STATUS_CANNOT_MAKE.  I’d never seen or even heard of this error message. In fact, the version of Process Monitor at the time showed a raw error code, 0xc00002ea, instead of the error’s display name, and so I had to look in the Windows Device Driver Kit’s Ntstatus.h header file to find the display name and add it to the Process Monitor function that converts error codes to text.

At that point I could have cheated and searched the Windows source code for the error, but I decided to see how someone without source access would troubleshoot the problem. A Web search took me to this old thread in a newsgroup for Windows file system developers:

Sure enough, the volume was formatted with the FAT file system and the number of files on the drive, including those with long file names, could certainly have accounted for the use of all available 512 root-directory entries.

I had solved the mystery. I told my friend he had two options: he could create a subdirectory off the volume’s root and copy the remaining files into there, or he could reformat the volume with the FAT32 file system, which removes the limitation on entries in the root directory.

One question remained, however. Why was the volume formatted as FAT instead of FAT32? The answer lies with both the USB drive makers and Windows format dialog. I’m not sure what convention the makers follow, but my guess is that many format their drives with FAT simply because it’s the file system guaranteed to work on virtually any operating system, including those that don’t support FAT32, like DOS 6 and Windows 95.

As for Windows, I would have expected it to always default to FAT32, but a quick look at the Format dialog’s pick for one of my USB drives showed I was wrong:

I couldn’t find the guidelines used by the dialog anywhere on the Web, so I looked at the source and found that Windows defaults to FAT for non-CD-ROM removable volumes that are smaller than 4GB in size.

I’d consider this case closed, but I have two loose ends to follow up on: see if I can get the error message fixed so that it’s more descriptive, and lobby to get the default format changed to FAT32.  Wish me luck.

Vista Multimedia Playback and Network Throughput

A few weeks ago a poster with the handle dloneranger reported in the 2CPU forums that he experienced reduced network throughput on his Vista system when he played audio or video. Other posters chimed in with similar results, and in the last week attention has been drawn to the behavior by other sites, including Slashdot and Zdnet blogger Adrian Kingsley-Hughes.

Many people have correctly surmised that the degradation in network performance during multimedia playback is directly connected with mechanisms employed by the Multimedia Class Scheduler Service (MMCSS), a feature new to Windows Vista that I covered in my three-part TechNet Magazine article series on Windows Vista kernel changes. Multimedia playback requires a constant rate of media streaming, and playback will glitch or sputter if its requirements aren’t met. The MMCSS service runs in the generic service hosting process Svchost.exe, where it automatically prioritizes the playback of video and audio in order to prevent other tasks from interfering with the CPU usage of the playback software:

When a multimedia application begins playback, the multimedia APIs it uses call the MMCSS service to boost the priority of the playback thread into the realtime range, which covers priorities 16-31, for up to 8ms of every 10ms interval of the time, depending on how much CPU the playback thread requires. Because other threads run at priorities in the dynamic priority range below 15, even very CPU intensive applications won’t interfere with the playback.

You can see the boost by playing an audio or video clip in Windows Media Player (WMP), running the Reliability and Performance Monitor (Start->Run->Perfmon), selecting the Performance Monitor item, and adding the Priority Current value for all the Wmplayer threads in the Thread object. Set the graph scale to 31 (the highest priority value on Windows) and you’ll easily spot the boosted thread, shown here running at priority 21:

Besides activity by other threads, media playback can also be affected by network activity. When a network packet arrives at system, it triggers a CPU interrupt, which causes the device driver for the device at which the packet arrived to execute an Interrupt Service Routine (ISR). Other device interrupts are blocked while ISRs run, so ISRs typically do some device book-keeping and then perform the more lengthy transfer of data to or from their device in a Deferred Procedure Call (DPC) that runs with device interrupts enabled. While DPCs execute with interrupts enabled, they take precedence over all thread execution, regardless of priority, on the processor on which they run, and can therefore impede media playback threads.

Network DPC receive processing is among the most expensive, because it includes handing packets to the TCP/IP driver, which can result in lengthy computation. The TCP/IP driver verifies each packet, determines the packet’s protocol, updates the connection state, finds the receiving application, and copies the received data into the application’s buffers. This Process Explorer screenshot shows how CPU usage for DPCs rose dramatically when I copied a large file from another system:

Tests of MMCSS during Vista development showed that, even with thread-priority boosting, heavy network traffic can cause enough long-running DPCs to prevent playback threads from keeping up with their media streaming requirements, resulting in glitching. MMCSS’ glitch-resistant mechanisms were therefore extended to include throttling of network activity. It does so by issuing a command to the NDIS device driver, which is the driver that gives packets received by network adapter drivers to the TCP/IP driver, that causes NDIS to “indicate”, or pass along, at most 10 packets per millisecond (10,000 packets per second).

Because the standard Ethernet frame size is about 1500 bytes, a limit of 10,000 packets per second equals a maximum throughput of roughly 15MB/s. 100Mb networks can handle at most 12MB/s, so if your system is on a 100Mb network, you typically won’t see any slowdown. However, if you have a 1Gb network infrastructure and both the sending system and your Vista receiving system have 1Gb network adapters, you’ll see throughput drop to roughly 15%.

Further, there’s an unfortunate bug in the NDIS throttling code that magnifies throttling if you have multiple NICs. If you have a system with both wireless and wired adapters, for instance, NDIS will process at most 8000 packets per second, and with three adapters it will process a maximum of 6000 packets per second. 6000 packets per second equals 9MB/s, a limit that’s visible even on 100Mb networks.

I caused throttling to be visible on my laptop, which has three adapters, by copying a large file to it from another system and then starting WMP and playing a song. The Task Manager screenshot below shows how the copy achieves a throughput of about 20%, but drops to around 6% on my 1Gb network after I start playing a song:

You can monitor the number of receive packets NDIS processes by adding the “packets received per second” counter in the Network object to the Performance Monitor view. Below, you can see the packet receive rate change as I ran the experiment. The number of packets NDIS processed didn’t realize the theoretical throttling maximum of 6,000, probably due to handshaking with the remote system.

Despite even this level of throttling, Internet traffic, even on the best broadband connection, won’t be affected. That’s because the multiplicity of intermediate connections between your system and another one on the Internet fragments packets and slows down packet travel, and therefore reduces the rate at which systems transfer data.

The throttling rate Vista uses was derived from experiments that reliably achieved glitch-resistant playback on systems with one CPU on 100Mb networks with high packet receive rates. The hard-coded limit was short-sighted with respect to today’s systems that have faster CPUs, multiple cores and Gigabit networks, and in addition to fixing the bug that affects throttling on multi-adapter systems, the networking team is actively working with the MMCSS team on a fix that allows for not so dramatically penalizing network traffic, while still delivering a glitch-resistant experience.

Stay tuned to my blog for more information.

The Case of the Failed File Compression

The other day Bryce tried to use Explorer’s Send To Compressed (zipped) Folder feature, seen below, to package up his latest Process Monitor source code updates to send me.

Instead of presenting compression progress dialog followed by an opportunity to edit the name of resulting compressed file, Explorer aborted the compression with this error:

Bryce was perplexed. The error didn’t seem to make any sense because he obviously had read permission to the files in the selection, which he’d just finished editing, and compressing files shouldn’t involve some kind of search that could result in a file not being found.  He retried the compression operation, but got the same error, this time after a different number of files had finished compressing.

I happened to walk into his office at this point and he showed me the behavior by trying a few more times, all with the same outcome.  Now both of us were perplexed. It was time to investigate, and the tool we called on for the job was, somewhat ironically, Process Monitor.

We launched Process Monitor, reproduced the failure, stopped the capture, and scanned through the thousands of operations in the trace looking for errors. We saw a slew of NOT FOUND errors near the start of the log, which are the generally innocuous result of an application checking for the pre-existence of a file. In fact, there were literally hundreds of them near the beginning of the log, all of which were queries for the file into which the compressed files would be placed:

That was disturbing, but not directly related to our troubleshooting effort, so I filed it away to look at later.

Several hundred events into the trace later we came across a SHARING VIOLATION error that bore a closer look:

When a process opens a file it can specify if and how it wants to share the file with other processes while it has the file opened. The three types of sharing are read, write and delete, and each is represented with a flag that a process passes to the CreateFile API. In the operation that failed, Explorer didn’t pass any of the flags, indicating that it didn’t want to share the file, as seen in the ShareMode field:

For an open to succeed, the sharing mode of the opener must be compatible with the sharing allowed by a process that already has the file opened, so the explanation for the error was that another process already had the file opened.

Looking back at the trace, the open operation immediately preceding the one with the error is an open of the same file by a process named Inort.exe. Inort’s close of the file isn’t visible in the screenshot because it comes long after Explorer’s failed attempt to open the file. That confirmed that Explorer’s unwillingness to share the file conflicted with Inort having the file open, despite the fact that Inort specified read, write and delete sharing in its open of the file.

Process Monitor had closed another case: Inort holding the file open when Explorer tried to open it was the cause of the sharing violation and almost certainly the reason for the misleading error message. Next we had to identify Inort so that we could come up with a fix or workaround. Process Monitor also answered that question with its image tooltip:

eTrust, Computer Associates’ Antivirus scanner, was apparently opening the file to scan it for viruses, but interfering with the operation of Explorer. Antivirus should be invisible to the system, so the error revealed a bug in eTrust. The workaround was for Bryce to set a directory filter that excludes his source directories from real-time scanning. 

I couldn’t reproduce the error when I went back to my office, so I suspected that I had a different version of Inoculan on my system than Bryce. Process Monitor’s process page on the event properties dialog for an Inort.exe event showed that Bryce had version 7.01.0192.0001 and I had the more recent 7.01.0501.000:

Why we have different versions isn’t clear since we’re both using images deployed and managed by Microsoft IT, but it appears that Computer Associates has fixed the bug in newer releases.

Now I turned my attention back to the inefficiencies of Explorer’s compression feature. I captured a Process Monitor trace of the compression of a single file and counted the associated operations. Just for this simple case, Explorer opened the target ZIP file 14 times, 12 of those before it had actually created the file and therefore with NOT FOUND results, and performed directory look ups of the target 19 times. It was also redundant with the source file, opening it 28 times and querying the file’s basic properties 17 times. It’s not like Explorer doesn’t give eTrust plenty of opportunities to cause sharing problems.

In order to verify that Explorer itself was at fault, and not some third-party extension, I looked at the stacks of various events by selecting the event and typing Ctrl+K to open the event properties dialog to the stack page:

Zipfldr.dll, the Explorer file compression DLL, was in most of the stack traces, meaning that the compression engine itself was ultimately responsible for the waste. Further, the number of repetitious operations explodes when you compress multiple files. There are clearly easy ways to improve the algorithm, so hopefully we’ll see a more efficient compression engine in Windows 7.

Update: I've learned that the compression engine has been updated in Vista SP1 to perform fewer file operations.

On a closing note, if you’d like to catch me at my next public speaking engagement, come to Wintellect’s Devscovery conference in Redmond, August 14-16, where I’m delivering a keynote on Vista kernel changes.

The Case of the Unexpected PsList Error

Not long after I deployed Windows Vista on my main desktop system I noticed that a process became unresponsive and appeared to be consuming excessive amounts of CPU. I had a command prompt handy, so I ran PsList to dump detailed information about the process as one of my troubleshooting steps. Instead of reporting apage full of statistics like I expected, however, PsList printed its banner, an error message, and exited:

 

PsList obtains information from the system performance counters, which an application accesses using standard Registry functions directed at the virtual HKEY_PERFORMANCE_DATA key, so the message indicated that PsList was unable to query the virtual performance keys. When you point PsList at a remote system and don’t have administrative rights on that system, or the system isn’t running the Remote Registry service, then PsList reports the same error, but I had never seen the message when using PsList to look at a local system. Something was different about Windows Vista and I set out to learn what.

Putting my original troubleshooting mission on hold, I launched Process Monitor and repeated the PsList command with Process Monitor looking on. I didn’t have a firm expectation that it would reveal the cause of the problem, but experience has taught me that Process Monitor (and its predecessors Filemon and Regmon) often solves seemingly unexplainable problems like this. I scanned the trace looking for anomalous error codes, because when they’re present they almost always point to the source of a problem, and found an access denied error:

For some reason, PsList, running as a standard user because I hadn’t elevated the command prompt from which I ran it, was unable to open the PerfLib registry key for read access. I was perplexed because on Windows XP I had been able to run PsList as a standard user. I launched Regedit, navigated to the key, and viewed its permissions. As I suspected, standard users aren’t members of any of the groups the permissions grant access (note the presence of two new performance-related groups, Performance Log Users and Performance Monitor Users):

 

I quickly confirmed that to be the reason for PsList’s failure by granting the Interactive Users group read access to the key and verifying that PsList subsequently worked. Now I was left with the question of what permissions Windows XP assigns the key. I switched to a Windows XP test system and viewed the key’s permissions. Sure enough, Interactive Users have read access, explaining why PsList works as a standard user on Windows XP systems:

I then pondered the reason for the change. I suspected the new permissions close an information disclosure hole, but after some thought I concluded that they aren’t closing any hole.  The PerfLib key is where performance providers register their counters and DLLs, so when a tool like PsList queries a counter the performance API loads the associated DLL into the querying process and calls functions in the DLL that return the desired data. Because the DLLs execute in the context of the process into which they load, they can’t implement security that can’t be easily circumvented by the process. It’s therefore the responsibility of a performance data source, which might be the kernel or an application like Internet Information Server (IIS), to prevent unauthorized access to its performance data.

Preventing read access to the PerfLib key is therefore the equivalent of having a performance DLL implement security. While locking down the key prevents the performance API from determining what counters are available and what DLLs provide performance data, with the exception of add-on applications, the core registrations are constant from system to system. That means that a process can circumvent any protection the locked-down key is attempting to provide by directly loading performance DLLs and calling their data functions.

To make a long story short, I filed a bug against Windows Vista Service Pack 1 (SP1) and Windows Server 2008 to have Interactive Users added back to PerfLib’s permissions. The reliability and diagnostics team reported back that the permissions changed inadvertently during the release of Windows Server 2003, but I convinced them it didn’t make sense, so in SP1 and Windows Server 2008 you won’t need to edit PerfLib’s permissions to be able to run tools like PsList as a standard user. 

Another case closed by Process Monitor!

The Case of the Insecure Security Software

A little over a year ago I set out to determine exactly why, prior to Window Vista, the Power Users security group was considered by most to be the equivalent of the Administrators group. I knew the answer lay in the fact that default Windows permissions allow the group to modify specific Registry keys and files  that  enable members of the group to elevate their privileges to that of the Local System or Administrators group, but I didn’t know of any concrete examples. I could have manually investigated the security on every file, directory and Registry key, but instead decided to write a utility, AccessChk, that would answer questions like this automatically. AccessChk quickly showed me directories, files, keys, and even Windows services written by third parties, that Power Users could modify to cause an elevation of privilege. I posted my findings in my blog post The Power in Power Users.

Since the posting, AccessChk has grown in popularity as a system security auditing tool that helps identify weak permissions problems. I’ve recently received requests from groups within Microsoft and elsewhere to extend its coverage of securable objects analyzed to include the Object Manager namespace (which stores named mutexes, semaphores and memory-mapped files), the Service Control Manager, and named pipes.

When I revisited the tool to add this support, I reran some of the same queries I had performed when I wrote the blog post, like seeing what system-global objects the Everyone and Users groups can modify.  The ability to change those objects almost always indicates the ability for unprivileged users to compromise other accounts, elevate to system or administrative privilege, or prevent services or programs run by the system or other users from functioning. For example, if an unprivileged user can change an executable in the %programfiles% directory they might be able to cause another user to execute their code. Some applications include Windows services, so if a user could change the service executable they could obtain system privileges.

These local elevation-of-privilege and denial-of-service holes are unimportant on single-user systems where the user is an administrator, but on systems where a user expects to be secure when running as a standard user (like Windows Vista), and on shared computers like a family PCs that have unprivileged accounts, Terminal Server systems, and kiosk computers, they break down the security boundaries that Windows provides to separate unprivileged users from each other and from the system. 

In my testing I executed AccessChk commands to look for potential security issues in each of the namespaces it supports. In the commands below, the -s option has AccessChk recurse a namespace, -w has it list only the objects for which the specified group – Everyone in the examples – has write access, and -u directs AccessChk to not report errors when it can’t query objects for which your account lacks permissions. The other switches indicate what namespace to examine, where the default is the file system.

File system:           accesschk everyone -wsu %programfiles%
File system:           accesschk everyone -wsu %systemroot%
Registry:                accesschk everyone -kwsu hklm
Processes:     &nbs