1. Introduction
I had this article almost ready way back when I was on Forefront TMG Team but never had time to finish. This is about an issue where wspsrv.exe process was consuming high CPU in random moments of the day and the only workaround to make this process to use less CPU was to restart Firewall Service. Maybe the behavior sounds familiar, but the final resolution was never documented here in this blog.
2. Gathering Data
Using Process Monitor was possible to see that there were lots of ETW Trace threads running as shown below, which was kind of interesting to me:
To move forward in this investigation the usual perfmon and dump of the wsprv.exe process were collected while the issue was happening.
3. Analyzing the Data
Using the same approach that I documented in the Troubleshooting Forefront TMG 2010 Performance issues Cheat Sheet it was possible to notice a pattern in the threads that were stuck in Critical Section, all of them had a similar stack as shown below:
At that point it was clear to me that the component involved in such behavior was NIS, because is NIS that uses GAPA Engine (read NIS white paper for more information). As a test we disabled NIS and restarted Firewall Service and as a result of this action the issue stopped occurring.
4. Conclusion
Of course this was not the solution, as we don’t want to permanently disable this feature, but at least confirmed that NIS was the component causing the issue. We enabled NIS again and the issue came back. Another set of dumps and Process Monitor analysis lead the investigation to confirm that verbose tracing was enabled causing NIS to impact wspsrv.exe process by consuming more CPU. The traces are:
The value possible values are: 0, 1, 2, 3 and 4 corresponding to Error, Warning, Info, Function and Noise, respectively. In this case it was 4, which indeed caused a lot of noise. The resolution was to change back to zero and restart firewall service. It is important to clarify that is not always that this behavior will happen when the lower level trace is high, in order words, don’t think you can always repro this issue by just increasing this value. The issue was a combination of factors, in this particular scenario the server was very busy and by having the lower level trace so high the CPU utilization was increasing. The overall recommendation is to increase this value only for troubleshooting purpose and decrease after collecting data.
After Tom Shinder successfully implemented the contest quiz on his blog and give some prizes to the winner (Jason Jones) last month during the MVP Summit (I was there and saw how much Jason was happy ), I started thinking that I should follow my friend Tom on this cool initiative and do something similar. So here how it will work this contest:
Are you ready to play? Next Monday (March 28th) the first round of questions will come. Stay tune!!
After working on this article for a couple of weeks on my spare time this Troubleshooting Survival Guide is now ready for you, check it out at:
http://social.technet.microsoft.com/wiki/contents/articles/forefront-threat-management-gateway-tmg-2010-troubleshooting-survival-guide.aspx
As you know, this is a Wiki article and you can contribute with it, but first take a look and see if you like it
Hi folks, we will start the contest today with five Forefront TMG questions. As I mentioned in the previous post, you will need to send me the answers via Twitter using Direct Message feature. In additional to that it is also important to clarify the following points:
The leaderboard will look like this one:
Now let’s move on to our first quiz – Forefront TMG 2010:
Question 1) Consider the following scenario:
A Forefront TMG administrator is performing some adjustments on his infra-structure to ensure that all workstations are using TMG as Proxy. He deployed via GPO a policy that configures Internet Explorer to use proxy.contoso.com as Proxy Server and also configured another policy to disable the user’s capability of changing this option. All users are using Internet Explorer 7. TMG admin also confirmed that his TMG Standard Edition is also correctly configured with a rule that allows Internet access only for the Active Directory Internet Users group.
The TMG Admin’s goal is to ensure that all workstations while browsing Internet throught TMG can use Kerberos as authentication protocol. After performing all those changes in the environment he started monitoring the behavior to validate if the authentication protocol in use was Kerberos. For his surprise the authentication protocol in use was NTLM. Why this is happening? (choose the best answer)
a. You need at least Internet Explorer 8 to make Kerberos work for Proxy authentication.
b. TMG admin must run a script on TMG to allow Kerberos authentication to work properly.
c. TMG has a wrong SPN on Active Directory.
d. The option “Require All Users to Authenticate” is selected in the Internal network.
Question 2) __________ is the only built in Forefront TMG driver that runs in Kernel mode.
a. wspsrv.exe
b. fweng.sys
c. tcpip.sys
d. isastg.exe
Question 3) After enabling HTTPS Inspection on Forefront TMG some users are experiencing problems in random sites where it says:
Error Code 502 (Proxy Error) – the certification authority that issued the SSL Server certificate supplied by a destination server is not trusted by the local computer.
It was identified a list of twenty websites that are experiencing this problem. By policy you can’t disable HTTPS Inspection, but you also need to make sure that the users are able to access those sites. What would be the fastest workaround for this situation assuming that the web sites that are experiencing these errors are authentic and trustable? (choose the best answer)
a. Add those web sites in the exemption list and choose the option “No validation” in TMG.
b. Research the CA that issued the certificate for each site, obtain the CA root certificate for each one of those and install on TMG.
c. Install TMG Client on each workstation.
d. Disable HTTPS Inspection feature.
Question 4) An user called helpdesk saying that one hour ago he was browsing Internet through TMG and he received the message below on his workstation:
He said that he closed this balloon and now this message is not showing anymore when he is browsing login.live.com. He is confused about traffic inspection and wants to know why he received the notification just once. The support personnel explained to the user that the traffic still getting inspected but since this notification already appeared once for this site it will stay in cache for some time and will not show up again until the cache expires or the computer gets restarted. What option below describes the default cache time for those TMG Client notifications?
a. 6 hours
b. 2 hours
c. 10 hours
d. 12 hours
Question 5) Forefront TMG is installed in a Server with two disks using the following distribution:
Users are complaining that it is too slow to download files from the Internet and sometimes it even fails. After some tests you determined that the issue does not happen if Malware inspection feature is disabled on TMG. What are the possible reasons that this could be happening? (choose the best two answers)
a. There is not enough space on disk F.
b. TMG is running out of RAM.
c. There is another process locking files on the malware temp folder.
d. Cache is corrupted.
Good luck and the answers will be posted this Friday (April 1st)!
Hi folks, I just want to give a quick update on my new book (in Portuguese) about Security+ Certification. This book I wrote in partnership with my friend Daniel Mauser and we are covering the foundations of the Security+ exam, some practice examples and some direct dialogues where we describe situations that we experienced while dealing with such security subject. The book has approximately 400 pages and it is a result of 17 months writing (since November 2009) in my spare time (although I’m not sure if I ever had one). Here it is the book cover:
Pre-sales for this book should start later this week in main Brazilian’s bookstore. Portuguese speakers can get more information at www.securityplusbr.org
The first round of questions of this contest is now closed. We had a total of 16 participants and only two got all answers correct. It was fun to interact with you guys via Twitter and also review your answers; it took me back in the day when I was a University Professor for Computer Networks discipline. There is an interesting pattern that I’ve been noticing since those days, which is: sometimes you miss a question not because you don’t know, but because you didn’t pay full attention to it. Four people didn't realize that in the last question you have to choose two options and selected only one option (next time pay more attention to those details). Another interesting pattern that I notice here was: everybody got the question two correct, which means we have a good foundation identifying TMG’s kernel mode driver, I like that.
As I previously said, this Friday (April 1st) I will be posting the answers for the quiz and will #FF the folks with more points on my twitter. Thanks for playing and start to get prepare for the next round (UAG) that will happen next Monday (April 4th).
Last week I presented a session on MVP Summit in Redmond about Troubleshooting TMG Performance issues. During that presentation I said to the MVPs there that I will be writing a cheat sheet with some WinDBG commands that can be used while troubleshooting TMG performance issues. I thought about this type of document and concluded that this content can have a base framework but it should be expanded and enhanced by the community. Having said that, I decided to write this article in two places:
Enjoy it !!
Last week on MVP Summit was indeed quiet busy, but the results were great. Here an interview with David Tesar for the Technet Edge site where I talk about the value of building a community based content using TechNet Wiki:
You can also download in WMV format.
As I announced in this post, yesterday I was on Talk TechNet Show with Keith Combs and Matt Hester. I had a great time talking about TMG and answering questions about the product. If you missed the show you can still listening the conversation by downloading the MP3 from here. I also want to say to the Forefront TMG Administrator’s Companion Book winner that I shipped the book today and it should be with you on Saturday .
BTW, don’t miss Tom Shinder’s interview on Talk TechNet tomorrow (Friday 11th), registration is still open here.
Disaster Recovery Plan, also known as DRP, this discipline is mainly concern about “Availability”, which is one of the main pillars in the Security Triad (Confidentiality, Integrity and Availability). The security principals (and common sense) determines that first and foremost we all need to make sure everyone is safe (human life is the top priority in any DRP). In an extreme situation, like our friends from Japan are living at this moment, there are more than just availability to be concern about: integrity and confidentiality might be gone for some business. In order to assist the business in Japan to have some guidelines on what to do to be back in business the article below was created:
http://social.technet.microsoft.com/wiki/contents/articles/windows-server-emergency-management-resources.aspx
Here are some important points to notice in this article:
…and also the tags that we currently have:
There are much more to add, so make sure that you take some time to add valuable information to this article. This can be very useful for those that are desperate to put their business back on track.
Consider the following scenario: Remote access VPN client users are unable to browse the Internet when connected to TMG and the web browser is configured to “automatically detect settings”. When connected, the WPAD record appears to be resolving to the IP address of the RRAS interface and not the interface of the TMG firewall.
This problem can happen because RRAS interface is higher than the internal interface in the binding order of the OS. One quick fix for that will be to change the binding order to have the internal interface on the top. Another approach is to follow the steps below:
1. Download the CarpNameSystem.js
2. Open command promot in elevated privilege and run the command:
cscript carpnamesystem.js /set: DNS
3. Restart Firewall Sevice and run the command below in the workstation that is connecting remotely:
del \wpad*.dat /s