Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
Besides Aero Glass, one of the most visible features of Windows Vista is the Sidebar with its set of default Gadgets, like the clock, RSS feed, and photo viewer. The convenience of having frequently-accessed information on the desktop and the ease of their development has led to the availability of literally thousands of third-party Gadgets through sites like the Windows Vista Gadget Gallery. I’ve downloaded and installed a few out of curiosity, and in some cases kept them in my Sidebar’s standard configuration, and never experienced a problem. A few days after installing a batch of new Gadgets, however, I noticed that a third-party clock Gadget had stopped updating, and so I set out to investigate.
My system was otherwise functioning normally, so my first step was to see if something was amiss with the Sidebar’s configuration. I right-clicked on the Sidebar screen area and selected the Properties menu item, but instead of displaying the Sidebar configuration dialog, the Sidebar crashed:
Gadgets run inside of shared Sidebar processes, so my first thought was that memory corruption in the Sidebar process had caused the clock to stop and subsequent crash, and verifying that theory required that I analyze the crash. The Windows Error Reporting (WER) service creates a crash-dump file, which is the saved state of a faulting process, in case you agree to send information to Microsoft about a problem. I clicked open the View Details area to see where Windows had saved the dump:
The last path displayed by the dialog, WERD8EE.tmp.mdmp, is a dump file, so I launched the Microsoft Debugging Tools for Windows Windbg utility and opened the file. When you open a dump file, Windbg automatically shows you the instruction that ultimately lead to the crash. In this case, it was a memory copy operation in Msvcrt, the Microsoft C Runtime:
The right side of the line showing the instruction indicates that the target address of the copy is 0. When a memory resource is exhausted, memory-allocation functions typically return address 0, also known as a NULL pointer, because that’s an illegal address by default for a Windows process (an application can manually create read/write memory at address zero, but in general it’s not done). The fact that Sidebar referenced address 0 didn’t conclusively mean the crash was due to low-memory instead of corruption, but it appeared likely.
I next looked at the code that led to the crash, which would tell me if it was a Gadget or the Sidebar itself that had passed a NULL pointer to the C Runtime. To do so, I opened Windbg’s stack dialog:
I had previously configured Windbg’s symbol path to point at the Microsoft symbol server so that Windbg reports names of internal functions in Windows images, because knowing function names can often make understanding a dump file easier. The functions listed in the stack trace implied that Sidebar was querying the version of a “package” when it crashed. I’m not sure what the Sidebar calls a package, but the trace did seem to show that Sidebar was the culprit, not a Gadget.
So had Sidebar run out of memory? There are several types of resource exhaustion that can cause a memory allocation to fail. For example, the system could have run out of committable memory, the process could have consumed all the memory in its own address space, or an internal heap could have reached its maximum size.
I started by checking the committed memory, since that was quick. Total commitable memory, also known as the commit limit, is the sum of the paging file(s) and most of physical memory. When commitable memory runs low, Windows Vista’s low-resource watchdog warns you by presenting a list of processes consuming the most memory and gives you the option of terminating them to relieve the memory pressure. I hadn’t seen a warning, so I doubted that this was the cause, but opened Process Explorer’s System Information dialog to check anyway:
As I suspected, there was plenty of available Committable memory. I next looked at Sidebar’s virtual memory usage. Memory leaks are caused when a process allocates virtual memory, stores some data in it, uses the data, but doesn’t free the memory when it’s done with the data. Virtual memory that processes allocate to store their own data is called Private Bytes, so I opened Process Explorer and added the Private Bytes column:
On a 32-bit Windows system, processes have 2 GB of address space available to them by default, so the highest possible Private Bytes value is close to 2 GB, which is exactly what the Sidebar process with process ID 4680 had consumed. That confirmed it: a memory leak in Sidebar caused it to run out of address space, which in turned caused a memory allocation to fail, which finally caused a NULL-pointer reference and a crash. I suspect that the clock stopped when Sidebar’s address space was exhausted and the clock Gadget couldn’t allocate resources to update its graphic.
Next I had to determine which Gadget was causing the leak, which may or may not have been the frozen clock Gadget. The Sidebar consists of two processes, one Sidebar.exe process that hosts the Windows Gadgets and a child Sidebar.exe process for third-party Gadgets. At this point I knew that a third-party Gadget had leaked memory or caused the Sidebar to leak, but I had several third-party Gadgets running and I didn’t know which one to blame. Unfortunately, the Sidebar offers no way to track memory usage by Gadget (or any other resource usage for that matter), so I had to apply manual steps to isolate the leak.
After restarting the Sidebar, I removed the third-party Gadgets and added them back one at a time, leaving each to run for a minute or two while I monitored Sidebar’s Private Bytes usage. I added the Private Bytes Delta column to Process Explorer’s display to make it easy to spot increases, and after adding one of the Gadgets I started to see periodic positive Private Bytes Delta values, implicating it as the leaker:
Now that I knew the guilty Gadget, I could have simply uninstalled it and considered the case closed. But I was curious to know how the Gadget had managed to cause a leak in the Sidebar – a leak that persisted even after I removed the Gadget.
The fact that it was using these APIs correctly meant that the leak was in the Sidebar’s code, but a quick Internet search didn’t turn up any mentions of a leak in the background object. If Sidebar APIs had a memory leak, why wasn’t it well known? I scanned the source code to the other Gadget’s on my system and discovered that none of them used the APIs, which explained why the leak isn't commonly encountered. However, comments in the Windows Gadget Gallery for the Gadget that inadvertently caused the leak revealed that other users had noticed it.
Having tracked the original unresponsive Gadget problem down to a leaky Sidebar API, I filed a bug in the Windows bug database and closed the case.
I have seen it on my machine also, and the symbols drive me to same leaking scenario. quite interesting stuff.
btw is any public release of the 18.104.22.168.1 windbg ?
Awesome investigation, as usual.
However, I think you are doing a major dis-service to us IT community by not revealing the name of the gadget. A Rogue gadget like this can kill entire Vista, on PCs with 1GB of RAM and relatively low Virtual Memory space, people will keep running into slowness of their system and not knowing why.
Was the author of the gadget notified at least? To give him/her a chance to maybe come up with a work around for the problem, for time being?
I wish that the debug symbols were more accessible to non-developers. It seems like you have to have an expensive MSDN membership just to have access to these tools; yet a lot of employers won't justify that for a person who is in a test or systems engineering role, who could find such tools useful for troubleshooting, even though they don't necessarily sling-code as their primary job.
The Windbg debugger and debug symbols are available to anybody. See the link to the symbol server in the blog post.
1. somebody isn't checking for NULL after calling malloc(). sidebar.exe should never crash.
This is false. If after a malloc, you get a NULL pointer, what can you do?
Trying to prevent that makes only the program crash elsewhere and its really tough track back the error in that case, because the stack trace is irrelevent. I suggest only adding debug asserts.
May I had that powerful memory leak revealer tools exist for both Native and Managed code. A trivial leak like this might have been easily found.
It's not only a bug.
Looks also like a bad design.
This canvas API's should not make any File-Version-Check ?
Or do I miss something ?
Amazing post! really gets in depth to the "digging to the bug" process in a very educative and entertaining way.
Quick Question please, I'm trying to create a folder named "con" on Windows Vista Desktop, and I'm receiving an error message "The specified device name is invalid"...
Any Explanation!! :(
Con is a reserved name in Win32 that represents serial ports.
Thanks for the feedback.
1. If you get NULL after allocating memory then you should inform the user not crash!
2. If "con" is a reserved device name then say so and not give an uninformative message "the device name is invalid"!
The second issue has existed for year in Windows and still not fixed. Also,
3. When you overwrite or delete a file in Windows Explorer which is in use, Windows does not still inform you by which application the file is used so that you may close that application before deleting/overwriting the file.
This issue has also existed for years.
By the way, "con" represents the console (keyboard) I think.
inform the user how? bearing in mind anything you do can't involve allocating memory...
and if you _do_ inform the user, what then? it's not like you can do anything about it, so you'll have to close anyway.
closing your application and informing the user is, not coincidentally, what happens when an application crashes, with the added advantage that a bug report is generated that might lead to the issue being fixed.
I've submitted those automatic crash reports for years, without seeing any change after many updates. I've concluded that if I'm very lucky and the crash is resolved, it will be in the next version of Windows, not this one.
Mark, are you seen the bug in the Calendar Gadget?
http://www.fayerwayer.com/up/2007/10/vista-calendario.png (13th August is twice)