Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
This case was shared with me by a friend of mine, Andrew Richards, a Microsoft Exchange Server Escalation Engineer. It’s a really interesting case because it highlights the use of a Sysinternals tool I specifically wrote for use by Microsoft support services and it’s actually two cases in one.
The case unfolds with a systems administrator at a corporation contacting Microsoft support to report that users across their network were complaining of Outlook hangs lasting up to 15-minutes. The fact that multiple users were experiencing the problem pointed at an Exchange issue, so the call was routed to Exchange Server support services.
The Exchange team has developed a Performance Monitor data collector set that includes several hundred counters that have proven useful for troubleshooting Exchange issues, including LDAP, RPC and SMTP message activity, Exchange connection counts, memory usage and processor usage. Exchange support had the administrator collect a log of the server’s activity with 12 hour log cycles, the first from 9pm until 9am the next morning. When Exchange support engineers viewed the log, two patterns were clear despite the heavy density of the plots: first and as expected, the Exchange server’s load increased during the morning when users came into work and started using Outlook; and second, the counter graphs showed a difference in behavior between about 8:05 and 8:20am, a duration that corresponded exactly to the long delays users were reporting:
The support engineers zoomed in and puzzled over the counters in the timeframe and could see Exchange’s CPU usage drop, the active connection count go down, and outbound response latency drastically increase, but they were unable to identify a cause:
They escalated the case to the next level of support and it was assigned to Andrew. Andrew studied the logs and concluded that he needed additional information about what Exchange was doing during an outage. Specifically, he wanted a process memory dump of Exchange when it was in the unresponsive state. This would contain the contents of the process address space, including its data and code, as well as the register state of the process’s threads. Dump files of the Exchange process would allow Andrew to look at Exchange’s threads to see what was causing them to stall.
One way to obtain a dump is to “attach” to the process with a debugger like Windbg from the Debugging Tools for Windows package (included with the Windows Software Development Kit) and execute the .dump command, but downloading and installing the tools, launching the debugger, attaching to the right process, and saving dumps is an involved procedure. Instead, Andrew directed the administrator to download the Sysinternals Procdump utility (a single utility that you can run without installing any software on the server). Procdump makes it easy to obtain dumps of a process and includes options that create multiple dumps at a specified interval. Andrew asked the administrator to run Procdump the next time the server’s CPU usage dropped so that it would generate five dumps of the Exchange Server engine process, Store.exe, spaced three seconds apart:
procdump –n 5 –s 3 store.exe c:\dumps\store_mini.dmp
procdump –n 5 –s 3 store.exe c:\dumps\store_mini.dmp
The next day the problem reproduced and the administrator sent Andrew the dump files Procdump had generated. When a process temporarily hangs it’s often because one thread in the process acquires a lock protecting data that other threads need to access, and holds the lock while performing some long-running operation. Andrew’s first step was therefore to check for held locks. The most commonly used intra-process synchronization lock is a critical section and the !locks debugger command lists the critical sections in a dump that are locked, the thread ID of the thread owning the lock, and the number of threads waiting to acquire it. Andrew used a similar command, !critlist from the Sieext.dll debugger extension (the public version of which, Sieextpub.dll, is downloadable from here). The output showed that multiple threads were piled up waiting for thread 223 to release a critical section:
His next step was to see what the owning thread was doing, which might point at the code responsible for the long delays. He switched to the owning thread’s register context using the ~ command and then dumped the thread’s stack with the k command:
As we’ve seen in previous Case of the Unexplained cases, the debugger was unsure how to interpret the stack when it came across a stack frame pointing into Savfmsevsapi, an image for which it couldn’t obtain symbols. Most Windows images have their symbols posted on the Microsoft symbol server so this was likely a third-party DLL loaded into the Store.exe process and was therefore a suspect in the hangs. The list modules (“lm”) command dumps version information for loaded images and the path of the image made it obvious that Savfmsevsapi was part of Symantec’s mail security product:
Andrew checked the other dumps and they all had similar stack traces. With the anecdotal evidence seeming to point at a Symantec issue, Andrew forwarded the dumps and his analysis, with the administrator’s permission, to Symantec technical support. Several hours later they reported that the dumps indeed revealed a problem with the mail application’s latest antivirus signature distribution and forwarded a patch to the administrator that would fix the bug. He applied it and continued to monitor the server to verify the fix. Sure enough, the server’s performance established fairly regular activity levels and the long delays disappeared.
However, over the subsequent days the administrator started to receive, albeit at a lower rate, complaints from several users that Outlook was sporadically hanging for up to a minute. Andrew asked the administrator to send a correlating 12-hour Performance Monitor capture with the Exchange data collection set, but this time there was no obvious anomaly:
Wondering if the hangs would be visible in the CPU usage history of Store.exe, he removed all the counters except for Store’s processor usage counter. When he zoomed in on the morning hours when users began to login and load on the server increased, he noticed three spikes around 8:30am:
Because the server has eight cores, the processor usage counter for an individual process has a possible range between 0 and 800, so the spikes were far from taxing the system, but definitely higher than Exchange’s typical range on that system. Zooming in further and setting the graph’s vertical scale to make the spikes more distinct, he observed that average CPU usage was always below about 75% of a single core and the spikes were 15-30 seconds long:
What was Exchange doing during the spikes? They were too short-lived and random for the administrator to run Procdump like he had before and reliably capture dumps when they occurred. Fortunately, I designed Procdump with this precise scenario in mind. It supports several trigger conditions that when met, cause it to generate a dump. For example, you can configure Procdump to generate a dump of a process when the process terminates, when its private memory usage exceeds a certain value, or even based on the value of a performance counter you specify. Its most basic trigger, though, is the CPU usage of the process exceeding a specified threshold for a specified length of time.
The Performance Monitor log gave Andrew the information he needed to craft a Procdump command line that would capture dumps for future CPU spikes:
procdump.exe -n 20 -s 10 -c 75 -u store.exe c:\dumps\store_75pc_10sec.dmp
procdump.exe -n 20 -s 10 -c 75 -u store.exe c:\dumps\store_75pc_10sec.dmp
The arguments configure Procdump to generate a dump of the Store.exe process when Store’s CPU usage exceeds 75% (-c 75) relative to a single core (-u) for 10 seconds (-s 10), to generate up to 20 dumps (-n 20) and then exit, and to save the dumps in the C:\Dumps directory with names that begin with “store_75pc_10sec”. The administrator executed the command before leaving work and when he checked on its progress the next morning it had finished creating 20 dump files. He emailed them to Andrew, who proceeded to study them in the Windbg debugger one by one.
When Procdump generates a dump because the CPU usage trigger is met, it sets the thread context in the dump file to the thread that was consuming the most CPU at the time of the dump. Since the debugger’s stack-dumping commands are relative to the current thread context, simply entering the stack dumping command shows the stack of the thread most likely to have caused a CPU spike. Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike. However, several of the dumps had stack traces similar to this one:
The stack frame that stuck out listed Store’s EcFindRow function, which implied that the spikes were caused by lengthy database queries, the kind that execute when Outlook accesses a mailbox folder with thousands of entries. With this clue in hand, Andrew suggested the administrator create an inventory of large mailboxes and pointed him at an article the Exchange support team had written that describes how to do this for each version of Exchange:
Sure enough, the script identified several users with folders containing tens of thousands of items. The administrator asked the users to reduce their item count to well below 5000 (the Exchange 2003 recommendation – this has been increased in each version with a recommendation of 100,000 in Exchange 2010) by archiving the items, deleting them, or organizing them into subfolders. Within a couple of days they had reorganized the problematic folders and user complaints ceased entirely. Ongoing monitoring of the Exchange server over the following week confirmed that the problem was gone.
With the help of Procdump, the compound case of the Outlook hangs was successfully closed.
I must completely agree: SysInternals rocks!
This is a fasinating discourse into the complexity of diagnosing complex client/server performance issues. Many thanks for sharing.
Superbly fascinating as always. Looking forward to the next one...
Half-reading this article i was suspecting for Symantec to be involved. Bingo!:) I had myself experienced server freezes because of Application Control driver of Symantec Endpoint Protection. Though i didn't use Procdump, just examined dumps with WinDbg. And we also do not install Network Threat Protection module on servers as it's likely to cause problems, especially on Hyper-V enabled servers. Everything from Symantec (server-side and client-side) seems to be evil.
SysInternals tools indeed rocks. I have inspected and solved lots of problems with them, including finding malware with Autoruns, examining hangs with Process Monitor.
Great article. I have a few questions: 1) What's the advantage of using procdump over adplus.vbs, as in adplus.vbs -hang -pn store.exe? I've had MS support advise me to run that command before. I know that it pauses the store.exe process while it's taking the dump, which means Exchange isn't available during that time. 2.) Does procdump also pause the store.exe process when used as shown in this article?
Fantastic as usual, Mark. Thanks for another great post.
Fantastic! Thanks for sharing this. I just love the science of debugging. And SysInternals does indeed rock :-)
Is this Exchange 2010? Too bad that the symbols are not available publicly for 2007 or 2010
-- SysAdmin-E --
Procdump is now the preferred dump tool used by Exchange Support because it :-
1) has better triggers,
2) is a tiny download & is xcopy deployable - the Debugging Tools for Windows are also xcopy'ble, but require an installation on another client prio (via the Windows SDK).
In this case, mini-dumps were taken (no -ma switch on procdump), so the hang (pause) lasted for <5sec, (i.e. no outage) - this is why we could take so many (-n 20)...
In the end, ADPlus.vbs and Procdump call the same APIs to do the dumping; Procdump is just easier to install, configure, run, remove, and debug (because of the added comment and default thread context).
-- ExpectedJoke --
This was Exchange 2003, hence the low (<5,000) folder item limit mentioned. Exchange 2007 and Exchange 2010 handle these expensive searches much better/faster.
PS: You can tell this is Exchange 2003 because it uses x86 stacks; all other versions of Exchange are x64. :-)
Thanks Mark, once again, great post. I have a couple of questions.
"Over half the dumps were inconclusive, apparently captured after the spike that triggered the dump had already ended, or with threads that were executing code that obviously wasn’t directly related to a spike."
1. How did you deduce that the 'inconclusive dumps' were not related to the spikes?
2. How did the EcFindRow frame 'stick out' from the other frames?
Debugging can be tackled two ways: use experience or use some generic techniques. As an Exchange Escalation Engineer I mainly use experience. I look at lots of user dumps (of Exchange) and kernel dumps (of Windows) every day, and just know what is or isn't a concern (it took less than 30min to review these 20 dumps). This of course isn't possible for the vast majority of people. As such, if you are not familiar with an application's thread stacks, you can apply some generic debugging techniques, and in most cases, review a dump with the public (or no) symbols, and only use a minidump (no -ma switch). Read on...
As Mark writes, the first step is to get a good set of dumps (at least a few hits out of the 20). The spike threshold used was very close to the operational (maximum) norm - close enough that a pegged single thread (+100%) would be easily picked up in the midst of 100-200 threads doing a few msec's of work each (per second) in a low usage period of the day. In this case, the collective total of all these relatively dormant threads is equivalent to a single thread using 20-75% of a single CPU. As mentioned, the first step is to gather performance counters so you can account for the background hum of your process; doing this correctly yields lots of positive dumps, and only a few false-positives. In this case, the biggest spike generated a maximum CPU usage of 147%, which means that the background hum (the other 200+ threads) only used 47% (147-100=47) at this time.
Usually, procdump scales CPU from 0..100 (no -u switch), so the 75% threshold has to be divided by 8 (the number of cores) to get the correct value for the switch. By using the -u switch, the -c value is much easier to determine when you have performance counters as the "% Processor" counter and "-u -c NN" value are scaled the same: 0..100% x Cores. Plus, using -u has the added benefit that you don't have to find out how many cores the server has so as to do the division!
"1. How did you deduce that the 'inconclusive dumps' were not related to the spikes?"
"2. How did the EcFindRow frame 'stick out' from the other frames?"
The Answer to #1 is in the 4 of 20 section.
The Answer to #2 is in the 6 of 20 section.
10 of 20: Because the dumps were taken over night, backup and online maintenane ran on the server... These operations involve only a single thread and generate a significant amount of CPU usage for 45-60sec. As such, they generated 10 of 20 dumps around 11pm and 2pm... The background usage of the Outlook clients was still there at around 20% during this period. In other words, these disk bound threads only needed to use 55% (75-20=55) at this time to trigger the dumps. These were positive dumps as the dump met the criteria, they just were not what we were looking for/concerned about.
4 of 20: When Procdump makes a CPU triggered dump, the default thread context is set to the thread that used the most amount of CPU in the monitoring period. When you open the dump in WinDBG, you'll get a cursor homed to this thread (e.g. 0:132>) - Note, this can sometimes be thread #0 (0:000>). In these 4 cases, the busy thread was actually doing nothing - how can this be? Well, in these cases, the worker thread WAS busy, but finished it's 10 seconds of usage JUST before the dump was taken. For example, it could have used 60% of this last second, and the other 200+ threads contributed 55% to the CPU usage - thus the total for the second was 115%, and the dump triggered - albeit too late for the culprit thread to be seen. Since, Exchange has worker threads, and the CPU spike thread was dormant. I classified these as false-positive dumps.
6 of 20: Six of the dumps had the EcFindRow operation. EcFindRow is the folder search operation of Exchange and happens every time you open a folder for viewing or searching. The bigger the folder, the longer it takes to run. Of interest, twice the triggerings were adjacent. That is, the spike lasted over 20sec (2x10sec) and two dumps were taken for the same EcFindRow operation. You can correlate two mini dumps by using the stacks. If you find a matching thread ID ("~.") and matching stack data ("kb"), you can assume that the operation is continuing in the second dump.
- Andrew Richards (MSFT)
A fantastic post - as usual. I always enjoy reading your posts. Thank you for taking the time to write it.
It's nice when you can get some graphs or data to look at but the issue I've been having is one that doesn't register in any sysinternal tool I've tried and it occurs at most daily taking few to dozen minutes before going away.
Thinking it now I do get some ideas. Since it rears its head as dropped frames in UI rendering, one should be able to create app that notices this drop in frame rate but since the rendering glitches last just milliseconds during which it's possible that something has the exclusive use of the system, it may be really problematic try to do a dump during the glitch if you can't even get a dump code running during one. Then again I don't recall the glitches affecting audio so maybe the glitches only affect some threads.
Wow that was an excellent entry. It's actually inspiring for us who works on performance problems. I'm working on Java applications, but the process is the same. We don't have the procdump facility to monitor performance counter spikes. But it is still possible to script them, either on Unix or Windows.