Talking about curious case to troubleshoot, this was a very good one. First because this was a new deployment, a well planned hardware for the deployment and the amount of traffic hitting TMG was not that huge. Everything seems to be okay, other than the fact that at least twice a week TMG was stopping responding and it was necessary to restart the whole server. To make things even more bizarre, such issue was happening in random days but always within the range of 8AM and 11AM.
2. Gathering Data
On this scenario TMG administrator couldn’t really get a dump of the wspsrv.exe process because he couldn’t even log in locally. It was necessary to prepare the machine for a kernel dump capture, for that I used the procedures from KB969028.
3. Analyzing the Data
On a scenario where the machine completely stops responding one of the most useful commands to type while analyze a kernel dump is the !locks command. By definition (windbg help), the !locks command will:
“Display all locks held on resources by threads. A lock can be shared or exclusive, which means no other threads can gain access to that resource. This information is useful when a deadlock occurs on a system. A deadlock is caused by one non-executing thread holding an exclusive lock on a resource that the executing thread needs.”
In this case here it is the result of this command:
0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks...........................................................................................................................................
Resource @ 0xfffffa800dbad748 Shared 1 owning threads
Contention Count = 493
Threads: fffffa800e09db63-01<*> *** Actual Thread fffffa800e09db60
KD: Scanning for held locks......
Resource @ 0xfffffa800bc2a7b8 Exclusively owned
KD: Scanning for held locks..........................................................................................
7461 total locks, 2 locks currently held
Next step is to verify which thread is that:
0: kd> !thread fffffa800e09db60
THREAD fffffa800e09db60 Cid 0004.0680 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (UserRequest) KernelMode Non-Alertable
fffffa8010231bd0: (0006,0430) Flags: 00060901 Mdl: fffffa8010287000
Owning Process fffffa8009a3a040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 26737768 Ticks: 3 (0:00:00:00.046)
Context Switch Count 940461
Win32 Start Address vhdmp!VhdmpiAsyncOpThread (0xfffff880076bcdb0)
Stack Init fffff88008267db0 Current fffff88008267440
Base fffff88008268000 Limit fffff88008262000 Call 0
Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff880`08267480 fffff800`01ac9992 : 00000000`0000000c fffffa80`0e09db60 fffffa80`00000000 fffffa80`10231bd0 : nt!KiSwapContext+0x7a
fffff880`082675c0 fffff800`01acbcff : 00000000`0000139c fffffa80`0cfc4060 00000000`00000000 fffffa80`0bf1a9a0 : nt!KiCommitThreadWait+0x1d2
fffff880`08267650 fffff800`01dbd1d2 : 00000000`00000000 ffffffff`00000006 00000000`00000000 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
fffff880`082676f0 fffff800`01ac1993 : fffffa80`0e09db60 00000000`00200200 00000000`00000000 fffffa80`0cfc4060 : nt!NtWaitForSingleObject+0xb2
fffff880`08267760 fffff800`01abdf30 : fffff880`07694640 fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`08267760)
fffff880`082678f8 fffff880`07694640 : fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c 00000000`000001d1 : nt!KiServiceLinkage
fffff880`08267900 fffff880`076b7ee2 : 00000000`00000000 fffff880`076b7e5c fffffa80`0cde80e0 fffffa80`0cde80e0 : vhdmp!VhdmpiReadVhdFileAtPassiveLevel+0x50
fffff880`08267970 fffff880`076bd9e0 : 00000000`00000000 00000000`000001d1 fffff8a0`0c2f2000 00000000`00001d10 : vhdmp!VhdmpiCopyRawSectorsSync+0x72
fffff880`082679c0 fffff880`076bdf11 : fffff8a0`01f26000 00000000`00022dbe fffff880`08267b90 fffff880`08267ac8 : vhdmp!VhdmpiRegisterWithRootEnumerator+0x220
fffff880`08267a60 fffff880`076c05e4 : 00000000`00000001 00000000`00000000 00000000`00022dbe fffffa80`0bc2a000 : vhdmp!VhdmpiShutdownNestingLevels+0x3b1
fffff880`08267b60 fffff880`076c083a : 00000000`00000000 00000000`00000001 fffffa80`0e573c00 fffffa80`0e573c00 : vhdmp!VhdmpiDoCompaction+0x124
fffff880`08267ca0 fffff880`076bce02 : ffffffff`ff676980 fffffa80`0e09db60 00000000`00000080 fffffa80`0cde80e0 : vhdmp!VhdmpiCompactThread+0x13a
fffff880`08267d00 fffff800`01d66c06 : fffffa80`0e09db60 00000000`00000080 fffffa80`09a3a040 00000000`00000001 : vhdmp!VhdmpiAsyncOpThread+0x52
fffff880`08267d40 fffff800`01aa0c26 : fffff880`009bf180 fffffa80`0e09db60 fffffa80`09a3ab60 fffff880`01452534 : nt!PspSystemThreadStartup+0x5a
fffff880`08267d80 00000000`00000000 : fffff880`08268000 fffff880`08262000 fffff880`08267450 00000000`00000000 : nt!KxStartSystemThread+0x16
Notice that this thread has an I/O request packet (IRP) and for scenarios like this (server freezes) it is very important to review the IRP for that thread that seems to be causing the issue. To do that you need to use the command !irp, here it is the result for this case:
0: kd> !irp fffffa8010231bd0
Irp is active with 12 stacks 9 is current (= 0xfffffa8010231ee0)
Mdl=fffffa8010287000: No System Buffer: Thread fffffa800e09db60: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
>[ 3,34] 10 e1 fffffa800a03f060 00000000 fffff88000e34180-fffffa800a04de20 Success Error Cancel pending
Args: 00200200 00000000 3bd4e26200 00000000
[ 3, 0] 0 e0 fffffa800a04dcd0 00000000 fffff88001792ee0-fffffa800de75450 Success Error Cancel
Args: 00200200 00000000 3bd4d26200 00000000
[ 3, 0] 0 e1 fffffa800a047040 00000000 fffff8800145ef70-fffffa800dabd550 Success Error Cancel pending
[ 3, 0] 0 1 fffffa800a069030 fffffa800df75880 00000000-00000000 pending
Args: 00200200 00000000 21ed477200 00000000
The major function 3,34 has the pending status set, which appears to be stuck on \Drover\Disk (Volume Managr – volmgr). Let’s take a closer look on the value fffffa800a03f060, which we can see by using the command !devobj as shown below:
0: kd> !devobj fffffa800a03f060
Device object (fffffa800a03f060) is for:
DR2 \Driver\Disk DriverObject fffffa8009e53e70
Current Irp 00000000 RefCount 0 Type 00000007 Flags 01002050
Vpb fffffa8009fd50b0 Dacl fffff9a10043d7f0 DevExt fffffa800a03f1b0 DevObjExt fffffa800a03f858 Dope fffffa8009feb450
Unknown flags 0x00000800
AttachedDevice (Upper) fffffa800a03fb90 \Driver\partmgr
AttachedTo (Lower) fffffa8009fe9b70 \Driver\USBSTOR
This is a USB driver, which is something that I was not expecting to have a this TMG box at all, so I decided to before even move forward ask to the TMG administrator what a USB driver is doing on TMG. TMG Administrator then clarified the scenario as follow: “We have a USB external driver attached to the system to store the full backup of the system that we perform every night.”
The dump is quiet clear, in other words: server is waiting for the disk. At this point I can conclude that the server was hanging waiting for this USB driver, which for some reason was not responding, as a consequence the whole box stop responding (including TMG, but clearly not a TMG issue).
After start reviewing the logs from the backup software we could conclude that: the backup job was supposed to finish at 6AM, however some days was extending beyond that, when the backup was running ahead of 8AM and the production started on TMG the system presented this behavior. Backup software was performing a full backup of the system, including TMG cache folder (which by itself was 4GB). We do not recommend making a backup of TMG cache folder, this folder needs to be excluded from backup. Making backup of the cache folder can cause other issue of this nature, such as crashing Firewall Service as I mentioned in this post. To resolve this problem we excluded the TMG cache folder from the backup and re-schedule the backup job to start at 11PM and let it run until 7AM in an attempt to avoid backup job to go beyond commercial business hours.
it looks like some cases I´ve seen at a customer...An USB driver to store the backup??? Wow!!!
Good Job Yuri..