I had this article almost ready way back when I was on Forefront TMG Team but never had time to finish. This is about an issue where wspsrv.exe process was consuming high CPU in random moments of the day and the only workaround to make this process to use less CPU was to restart Firewall Service. Maybe the behavior sounds familiar, but the final resolution was never documented here in this blog.
2. Gathering Data
Using Process Monitor was possible to see that there were lots of ETW Trace threads running as shown below, which was kind of interesting to me:
To move forward in this investigation the usual perfmon and dump of the wsprv.exe process were collected while the issue was happening.
3. Analyzing the Data
Using the same approach that I documented in the Troubleshooting Forefront TMG 2010 Performance issues Cheat Sheet it was possible to notice a pattern in the threads that were stuck in Critical Section, all of them had a similar stack as shown below:
At that point it was clear to me that the component involved in such behavior was NIS, because is NIS that uses GAPA Engine (read NIS white paper for more information). As a test we disabled NIS and restarted Firewall Service and as a result of this action the issue stopped occurring.
Of course this was not the solution, as we don’t want to permanently disable this feature, but at least confirmed that NIS was the component causing the issue. We enabled NIS again and the issue came back. Another set of dumps and Process Monitor analysis lead the investigation to confirm that verbose tracing was enabled causing NIS to impact wspsrv.exe process by consuming more CPU. The traces are:
The value possible values are: 0, 1, 2, 3 and 4 corresponding to Error, Warning, Info, Function and Noise, respectively. In this case it was 4, which indeed caused a lot of noise. The resolution was to change back to zero and restart firewall service. It is important to clarify that is not always that this behavior will happen when the lower level trace is high, in order words, don’t think you can always repro this issue by just increasing this value. The issue was a combination of factors, in this particular scenario the server was very busy and by having the lower level trace so high the CPU utilization was increasing. The overall recommendation is to increase this value only for troubleshooting purpose and decrease after collecting data.
awesome! But, why this registry value was 4 in the first place?
That's the one million dollars question that I can't answer as I was not the Admin of this system. Neither the admin could answer as he also had no idea who changed. :)
To bad that the change tracking feature doesnt track other changes on the system ;)
Nice article Yuri! Keep up the good work!
You're right Anders, will be very nice if tracking feature could do that.
Thanks for visiting this blog !!
I have exactly the same issue you are describing in this article. I tried your solution with the registry keys but they don't exist on my server. Do you have any idea?
Another approach that you can try is to isolate per signature to see if it is one particular signature that it is causig issues. For example:
- Organize NIS by Protocols.
- Instead of disabling the full NIS capability, start changing some signatures to detect only.
Also, check in the Alerts tab if there is any NIS signature that is consistently getting hit. You might want to start with that one in case you have it.
Other than that the most accurate way to understand what's going on is by collecting perfmon and dumps. If might be necessary in this case to open an incident with MS CSS.