We just release a new update for Forefront TMG 2010 called Update 1, this update one includes:
• SafeSearch Enforcement. Forefront TMG can enforce blocking adult text, images and videos from search results by popular search engines. SafeSearch can be enforced on specific groups or to the entire organization. • Including non-primary URL filtering categorizations. Forefront TMG uses an algorithm to select a URL’s “primary” category from among up to four categorizations provided by Microsoft Reputation Services (MRS). In Update 1 you can control access to sites that match any of the non-primary categorizations provided by MRS. For example, a URL with a primary categorization of News can now match a rule by any of its non-primary categorizations (such as Web Mail). • Support for Exchange 2010 SP1 • Bug fixes and various other improvements. Details will be soon available at http://go.microsoft.com/fwlink/?LinkId=201151.
Go get it here.
1. Introduction
Talking about curious case to troubleshoot, this was a very good one. First because this was a new deployment, a well planned hardware for the deployment and the amount of traffic hitting TMG was not that huge. Everything seems to be okay, other than the fact that at least twice a week TMG was stopping responding and it was necessary to restart the whole server. To make things even more bizarre, such issue was happening in random days but always within the range of 8AM and 11AM.
2. Gathering Data
On this scenario TMG administrator couldn’t really get a dump of the wspsrv.exe process because he couldn’t even log in locally. It was necessary to prepare the machine for a kernel dump capture, for that I used the procedures from KB969028.
3. Analyzing the Data
On a scenario where the machine completely stops responding one of the most useful commands to type while analyze a kernel dump is the !locks command. By definition (windbg help), the !locks command will:
“Display all locks held on resources by threads. A lock can be shared or exclusive, which means no other threads can gain access to that resource. This information is useful when a deadlock occurs on a system. A deadlock is caused by one non-executing thread holding an exclusive lock on a resource that the executing thread needs.”
In this case here it is the result of this command:
0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks...........................................................................................................................................
Resource @ 0xfffffa800dbad748 Shared 1 owning threads
Contention Count = 493
Threads: fffffa800e09db63-01<*> *** Actual Thread fffffa800e09db60
KD: Scanning for held locks......
Resource @ 0xfffffa800bc2a7b8 Exclusively owned
Threads: fffffa800e09db60-01<*>
KD: Scanning for held locks..........................................................................................
7461 total locks, 2 locks currently held
Next step is to verify which thread is that:
0: kd> !thread fffffa800e09db60
THREAD fffffa800e09db60 Cid 0004.0680 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (UserRequest) KernelMode Non-Alertable
fffffa800cfc4060 SynchronizationEvent
IRP List:
fffffa8010231bd0: (0006,0430) Flags: 00060901 Mdl: fffffa8010287000
Not impersonating
DeviceMap fffff8a000008b30
Owning Process fffffa8009a3a040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 26737768 Ticks: 3 (0:00:00:00.046)
Context Switch Count 940461
UserTime 00:00:00.000
KernelTime 00:13:42.562
Win32 Start Address vhdmp!VhdmpiAsyncOpThread (0xfffff880076bcdb0)
Stack Init fffff88008267db0 Current fffff88008267440
Base fffff88008268000 Limit fffff88008262000 Call 0
Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff880`08267480 fffff800`01ac9992 : 00000000`0000000c fffffa80`0e09db60 fffffa80`00000000 fffffa80`10231bd0 : nt!KiSwapContext+0x7a
fffff880`082675c0 fffff800`01acbcff : 00000000`0000139c fffffa80`0cfc4060 00000000`00000000 fffffa80`0bf1a9a0 : nt!KiCommitThreadWait+0x1d2
fffff880`08267650 fffff800`01dbd1d2 : 00000000`00000000 ffffffff`00000006 00000000`00000000 00000000`00000000 : nt!KeWaitForSingleObject+0x19f
fffff880`082676f0 fffff800`01ac1993 : fffffa80`0e09db60 00000000`00200200 00000000`00000000 fffffa80`0cfc4060 : nt!NtWaitForSingleObject+0xb2
fffff880`08267760 fffff800`01abdf30 : fffff880`07694640 fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`08267760)
fffff880`082678f8 fffff880`07694640 : fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c 00000000`000001d1 : nt!KiServiceLinkage
fffff880`08267900 fffff880`076b7ee2 : 00000000`00000000 fffff880`076b7e5c fffffa80`0cde80e0 fffffa80`0cde80e0 : vhdmp!VhdmpiReadVhdFileAtPassiveLevel+0x50
fffff880`08267970 fffff880`076bd9e0 : 00000000`00000000 00000000`000001d1 fffff8a0`0c2f2000 00000000`00001d10 : vhdmp!VhdmpiCopyRawSectorsSync+0x72
fffff880`082679c0 fffff880`076bdf11 : fffff8a0`01f26000 00000000`00022dbe fffff880`08267b90 fffff880`08267ac8 : vhdmp!VhdmpiRegisterWithRootEnumerator+0x220
fffff880`08267a60 fffff880`076c05e4 : 00000000`00000001 00000000`00000000 00000000`00022dbe fffffa80`0bc2a000 : vhdmp!VhdmpiShutdownNestingLevels+0x3b1
fffff880`08267b60 fffff880`076c083a : 00000000`00000000 00000000`00000001 fffffa80`0e573c00 fffffa80`0e573c00 : vhdmp!VhdmpiDoCompaction+0x124
fffff880`08267ca0 fffff880`076bce02 : ffffffff`ff676980 fffffa80`0e09db60 00000000`00000080 fffffa80`0cde80e0 : vhdmp!VhdmpiCompactThread+0x13a
fffff880`08267d00 fffff800`01d66c06 : fffffa80`0e09db60 00000000`00000080 fffffa80`09a3a040 00000000`00000001 : vhdmp!VhdmpiAsyncOpThread+0x52
fffff880`08267d40 fffff800`01aa0c26 : fffff880`009bf180 fffffa80`0e09db60 fffffa80`09a3ab60 fffff880`01452534 : nt!PspSystemThreadStartup+0x5a
fffff880`08267d80 00000000`00000000 : fffff880`08268000 fffff880`08262000 fffff880`08267450 00000000`00000000 : nt!KxStartSystemThread+0x16
Notice that this thread has an I/O request packet (IRP) and for scenarios like this (server freezes) it is very important to review the IRP for that thread that seems to be causing the issue. To do that you need to use the command !irp, here it is the result for this case:
0: kd> !irp fffffa8010231bd0
Irp is active with 12 stacks 9 is current (= 0xfffffa8010231ee0)
Mdl=fffffa8010287000: No System Buffer: Thread fffffa800e09db60: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ 3,34] 10 e1 fffffa800a03f060 00000000 fffff88000e34180-fffffa800a04de20 Success Error Cancel pending
\Driver\Disk volmgr!VmpReadWriteCompletionRoutine
Args: 00200200 00000000 3bd4e26200 00000000
[ 3, 0] 0 e0 fffffa800a04dcd0 00000000 fffff88001792ee0-fffffa800de75450 Success Error Cancel
\Driver\volmgr volsnap!VspReadCompletionRoutine
Args: 00200200 00000000 3bd4d26200 00000000
[ 3, 0] 0 e1 fffffa800a047040 00000000 fffff8800145ef70-fffffa800dabd550 Success Error Cancel pending
\Driver\volsnap Ntfs!NtfsMasterIrpAsyncCompletionRoutine
[ 3, 0] 0 1 fffffa800a069030 fffffa800df75880 00000000-00000000 pending
\FileSystem\Ntfs
Args: 00200200 00000000 21ed477200 00000000
The major function 3,34 has the pending status set, which appears to be stuck on \Drover\Disk (Volume Managr – volmgr). Let’s take a closer look on the value fffffa800a03f060, which we can see by using the command !devobj as shown below:
0: kd> !devobj fffffa800a03f060
Device object (fffffa800a03f060) is for:
DR2 \Driver\Disk DriverObject fffffa8009e53e70
Current Irp 00000000 RefCount 0 Type 00000007 Flags 01002050
Vpb fffffa8009fd50b0 Dacl fffff9a10043d7f0 DevExt fffffa800a03f1b0 DevObjExt fffffa800a03f858 Dope fffffa8009feb450
ExtensionFlags (0x00000800)
Unknown flags 0x00000800
AttachedDevice (Upper) fffffa800a03fb90 \Driver\partmgr
AttachedTo (Lower) fffffa8009fe9b70 \Driver\USBSTOR
This is a USB driver, which is something that I was not expecting to have a this TMG box at all, so I decided to before even move forward ask to the TMG administrator what a USB driver is doing on TMG. TMG Administrator then clarified the scenario as follow: “We have a USB external driver attached to the system to store the full backup of the system that we perform every night.”
The dump is quiet clear, in other words: server is waiting for the disk. At this point I can conclude that the server was hanging waiting for this USB driver, which for some reason was not responding, as a consequence the whole box stop responding (including TMG, but clearly not a TMG issue).
4. Conclusion
After start reviewing the logs from the backup software we could conclude that: the backup job was supposed to finish at 6AM, however some days was extending beyond that, when the backup was running ahead of 8AM and the production started on TMG the system presented this behavior. Backup software was performing a full backup of the system, including TMG cache folder (which by itself was 4GB). We do not recommend making a backup of TMG cache folder, this folder needs to be excluded from backup. Making backup of the cache folder can cause other issue of this nature, such as crashing Firewall Service as I mentioned in this post. To resolve this problem we excluded the TMG cache folder from the backup and re-schedule the backup job to start at 11PM and let it run until 7AM in an attempt to avoid backup job to go beyond commercial business hours.
I would like to take some time today to bring to y’all attention some new books that are out there on Forefront UAG and TMG technologies. My buddies Ben Ari and Ran Dolev (both from CSS UAG Team) are finishing writing the new Microsoft Forefront UAG 2010 Administrator's Handbook, since those guys are very sharp on UAG you surely can expect a great content. Another new book out there about TMG is from the Forefront MVP Tamás Gál, more info about the book (in Hungarian) is available here. Now that me and Tomas pretty much finished writing the Forefront books series (see previous blog post) I can take some time to finish another project that I’m working since December last year, which is the Security+ Certification Guide book (in Portuguese only). This one I should be done writing by next month (October) and the book will be released in Brazil by April 2011, more info will be available at www.securityplusbr.org.
Just a quick update on the Forefront eBooks,
Me and Tom just finished writing and reviewing all three books, today we wrote the About the Authors and Acknowledgements. The content is now on the final stage with MSPress and the books are available for pre-order at Amazon.com, check it out at:
Enjoy it !!
Recently I assisted an IT Pro on the Forum thread http://social.technet.microsoft.com/Forums/en-US/Forefrontedgesetup/thread/c7606ead-5957-4ef8-a4e9-e5aa85493581 where the problem was related to this behavior described in the title of this post. The problem happens very precisely, after applying TMG SP1 the TMG Managed Control Service takes too long to start, although it does start. As I describe in the Forum, if you are facing this issue and wants to use the private hotfix for that you will need to open a support case with Microsoft. If the hotfix resolves the problem the case won’t be charged, however if the hotfix does not fix the issue this means that the problem is not the same although the symptom could be similar. In other words, if further troubleshooting is necessary the case will be normally charged. To open a support case use this link http://support.microsoft.com/select/Default.aspx?target=assistance , select Forefront Threat Management Gateway (under security) and follow the wizard.
Whatta busy week here with this problem. As you might be aware, Exchange 2010 SP1 breaks the integration that TMG has with Exchange Edge when using E-Mail Protection feature, the issue causes the TMG Managed Control Service to fail to start. This issue is documented on our team blog http://blogs.technet.com/b/isablog/archive/2010/09/01/problems-when-installing-exchange-2010-service-pack-1-on-a-tmg-configured-for-mail-protection.aspx and on Exchange Team blog at http://msexchangeteam.com/archive/2010/09/01/456094.aspx. We are working on a hotfix for this issue and more news will be posted on the TMG Team Blog as soon as we test the fix in our lab and we confirm that works.
One thing that it is important to keep in mind is the following: don’t make changes directly on Exchange or FPE installed on the TMG Server when this issue happens. What I mean with that? Here it is the issue flow so you can better understand:
The reason behind that is because if you make changes, there is a possibility that once we fix the original issue (with the hotfix that will come up soon), TMG Managed Control Service will still failing to start with the error message below:
…and on Event Viewer you might see the error below:
Log Name: Application
Source: Microsoft Forefront TMG Control
Event ID: 31308
Task Category: None
Level: Error
Keywords: Classic
Description:
The Forefront TMG Managed Control service failed to initialize. Error information: Value does not fall within the expected range.
This error has nothing to do with the issue introduced by Exchange 2010 SP1, this is another problem that can be caused by changes that are made directly on those products and when TMG Managed Control Service tries to synchronize the config, it fails. It is important to mention that this failure doesn’t always happen, the normal behavior is to TMG detect the changes and undo it. When TMG is able to do that you will see the following alert:
Solution? Undo the manual changes that were done via those products' console and make sure to always use TMG to manage the configuration among those products (at least this is the behavior now TMG RTM and SP1). In this particular case that TMG Managed Control Service is down, you can’t make changes via TMG console because it will not sync with Exchange Edge and FEP, therefore you should hold any change to be done only after the hotfix for this issue (Exchange 2010 SP1 problem) is released and you had applied on the system.