Introduction
The goal of this post is to show how DebugDiag 1.2 can assist you identifying a potential source of bottleneck on a scenario where TMG user mode process (wspsrv.exe) is consuming high amount of CPU.
Data Gathering
First part is to make sure you collect the user mode dump while the issue is happening. To do that, use the approach that I explain in the following post:
http://blogs.technet.com/b/yuridiogenes/archive/2010/05/01/how-to-capture-a-manual-dump-of-the-wspsrv-exe-process-on-tmg-2010.aspx
Data Analysis
Once you have the data you can use DebugDiag to analyze the dump. Follow the steps below in order to perform this analysis:
1. After installing Debug Diag (64 bits edition in this case), launch it and cancel the first window.
2. Click Advanced Analysis tab.
3. Click Add Data Files button and choose the dump file that was previously collected.
4. Choose the scenario that applies to this issue in the top pane. In this case the scenario is Crash/Hang Analyzers as shown below:
5. Click Start Analysis.
6. Wait until the report is generated.
Reviewing the Report
Don’t go too far on the report before reviewing the first part of it, which is the Analysis Summary. Here it is the example for this scenario:
In this case the warning message says:
Detected a possible critical section related problem in wspsrv.dmp Lock at 0x015e7c70 is Unlocked Impact analysis 0.67% of threads blocked (Threads 78) The following functions are involved in the root cause GapaEngine_1cc44e8_bace5e90+10e22
The thread number has a hyperlink on it, when you click on this hyperlink you will see the stack that it is referring to:
ntdll!ZwWaitForSingleObject+a ntdll!RtlpWaitOnCriticalSection+e8 ntdll!RtlEnterCriticalSection+d1 GapaEngine_1cc44e8_bace5e90+10e22 0x454b64d8 0x0300e000 0x015ccbe8 0x4b80e418 0x015ccbe8 GapaEngine_1cc44e8_bace5e90+ff44 0x00004441`014dd475 0x00000010
The recommendation that DebugDiag gives is:
The following vendors were identified for follow up based on root cause analysis Unknown vendor for module C:\Program Files\Microsoft Forefront Threat Management Gateway\IPS\GapaEngine_1cc44e8_bace5e90.dll Please follow up with the vendors identified above
In other words, it is telling me to investigate further this module. Now what? Well, now you have an initial path to follow, you know that GAPA Engine is involved, which means that you can start doing some tests, such as:
It is important to remember that troubleshooting performance issue can be a long process and DebugDiag can assist you to find the root cause. However, sometimes finding the culprit doesn’t fix the issue, just show who is causing the problem, in this case further investigation is needed to find out how to really fix the issue.
This post is about a problem where Outlook was working fine through TMG publishing rule, however when TMG Admin tried to access OAB and OOF through Outlook he got an error. To bypass Outlook he tried to access https://mail.contoso.com/ews/exchange.asmx and got 403. The 403 was coming from Exchange vdir /EWS/, here an example of the header:
10.20.20.11 10.20.20.1 HTTP HTTP:Response, HTTP/1.1, Status Code = 403, URL: /ews/ - Http: Response, HTTP/1.1, Status Code = 403, URL: /ews/ ProtocolVersion: HTTP/1.1 StatusCode: 403, Forbidden Reason: Forbidden Server: Microsoft-IIS/7.5 Set-Cookie: exchangecookie=599fc2a7540e4e66b1169d9d5c358aa5; expires=Sat, 17-Jul-2011 21:39:05 GMT; path=/; HttpOnly XPoweredBy: ASP.NET Date: Fri, 29 Jan 2010 21:39:05 GMT ContentLength: 0 HeaderEnd: CRLF
Resolution: after some investigation we notice that the /EWS has anonymous on it (/EWS vdir on Exchange 2007 doesn't have anonymous by default), after disabling anonymous and leave only Basic (to match with the delegation) it worked.
Important points before adopting this resolution:
While working on this issue with the Exchange folks they warned me about this action (disabling anonymous for /EWS on Exchange 2010) and they told me that:
“There are some issues if you disable anonymous on /EWS/ vidr for Exchange 2010. Anonymous is enabled on the virtual directory because EWS uses ws-security for federating calendars and free/busy across organizations for the new calendar sharing feature. Federation occurs via the ws-security protocol, which authenticates via SOAP <wssecurity> header rather than an HTTP authentication header. IIS must let such requests go through, after which WCF (upon which EWS is built) will properly authenticate them - in other words the "anonymous" IIS setting does not allow anonymous requests to get through to EWS. Turning off anonymous has some side effects, namely that cross-organization (federated) calendar sharing breaks as does federated mailbox migration.”
Having those considerations in mind, what you can do in TMG to overcome that without disabling anonymous is:
Last month I was traveling to deliver some presentations about Migration to the Cloud and On-Premise Security. While traveling and talking to IT PROs I realized that the majority of the companies that I was exposed to during those conversations are not investing to make sure that their employees are well trained when the subject is security. In the past security awareness training was something that only large enterprises used to implement as part of the mandatory annual training calendar for all employees. This can’t be the case today, as a matter of fact small and medium business must develop a plan to spread the word about security for their employees.
As companies are moving to the cloud, Internet become even more crucial to their business, which means that users will be even more exposed to online resources. More businesses are using social networks to get closer to their customers and employees are using social networks for both purposes: personal and professional. There are many risks involved with social networks, but the growing one is called “Social Engineering”. This trend is exposed in the Microsoft’s Security Intelligence Report - Volume 10. The slide below summarizes that:
For more information about the the slide above watch the video below with a brief discussion about MS SIR Volume 10:
After watching this video you will also see that social engineering attack will take place in the online world via social networks, phishing e-mails and other venues. These type of social engineering attacks are getting high exposure in the news, recently I read an article that says:
“Defendants targeted university's databases of faculty, staff, alumni, and student information, and financial accounts with a social engineering scheme that used poisoned USBs, phishing emails”
From: http://www.darkreading.com/database-security/167901020/security/application-security/231000376/former-college-kid-s-guilty-plea-to-hacking-highlights-low-tech-db-theft.html
I recommend you reading this article to really see the social engineering approach in this case and start thinking about this subject. What if this was with one of your employees? Are your employees trained to understand the security risks while dealing with similar situation? I guess that at this point in time we can easily answer the question that entitle this post.
What should I do?
A great way to start your security awareness program is by leveraging what is already available for you (for FREE). Microsoft has a security awareness program toolkit and guide that can assist you to kick off your security awareness initiative. You can download the content from the link below:
http://download.microsoft.com/download/1/9/9/1990AA19-2C4F-42D0-9A22-1E158EF0ABBC/Security%20Awareness%20Content.zip
When you extract this content you will see the following structure:
The “how to guide” has the guidance that you need in order to use this material. This package includes training materials for risk management, security controls and incident response. It also includes templates for:
In addition to that you can also download the Internet Safety for Enterprise & Organizations toolkit to help your employees learn the skills they need to work more safely on the Internet and better defend company, customer, and their own personal information.
Conclusion
In summary I want to conclude this post saying: while it is important to invest in technology to protect your assets it is also important to invest in education for your employees, a well trained employee can save you a lot time and money. Keep that in mind !
If you are following this blog for a long time you probably know about my previous posts related to ISA or TMG crashing and about the fact that 95% of the time is not an issue caused by ISA/TMG. Well, this is just another crash where the first blame goes to ISA/TMG, in this particular case, ISA. The first argument is: is ISA that is triggering the error on event viewer. True statement as we see below:
Event Type: Error Event Source: Microsoft ISA Server 2006 Event Category: None Event ID: 1000 Time: 17:31:40 User: N/A Description: Faulting application wspsrv.exe, version 5.0.5723.516, stamp 4a880d39, faulting module unknown, version 0.0.0.0, stamp 00000000, debug? 0, fault address 0x1078b242.
Still doesn't mean that it’s an ISA issue though, but I’m okay of looking for help with ISA folks first, it’s normal. If there is a crash we should also have a dump and if we don’t, use DebugDiag (the newer version that I showed yesterday) to attach to the crashed process and get the dump. Let’s see the dump for this particular scenario:
FAULTING_IP: AkrFiltr+b992 1203b992 ?? ??? EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff) ExceptionAddress: 1203b992 (<Unloaded_AkrFiltr.dll>+0x0000b992) ExceptionCode: c0000005 (Access violation) ExceptionFlags: 00000000 NumberParameters: 2 Parameter[0]: 00000000 Parameter[1]: 1203b992 Attempt to read from address 1203b992
PROCESS_NAME: wspsrv.exe
FAULTING_MODULE: 7c800000 ntdll DEBUG_FLR_IMAGE_TIMESTAMP: 48ebaac7 ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_PARAMETER1: 00000000 EXCEPTION_PARAMETER2: 1203b992 READ_ADDRESS: 1203b992
FOLLOWUP_IP: AkrFiltr+b992 1203b992 ?? ??? FAULTING_THREAD: 00000fb0 BUGCHECK_STR: APPLICATION_FAULT_BAD_INSTRUCTION_PTR_INVALID_POINTER_READ_WRONG_SYMBOLS PRIMARY_PROBLEM_CLASS: BAD_INSTRUCTION_PTR DEFAULT_BUCKET_ID: BAD_INSTRUCTION_PTR LAST_CONTROL_TRANSFER: from 00000000 to 1203b992
STACK_TEXT: 123ffc9c 00000000 1204109a 123ffd04 120d2110 <Unloaded_AkrFiltr.dll>+0xb992
FAILED_INSTRUCTION_ADDRESS: AkrFiltr+b992 1203b992 ?? ??? SYMBOL_STACK_INDEX: 0 SYMBOL_NAME: AkrFiltr+b992 FOLLOWUP_NAME: MachineOwner
MODULE_NAME: AkrFiltr
IMAGE_NAME: AkrFiltr.dll STACK_COMMAND: ~40s; .ecxr ; kb BUCKET_ID: WRONG_SYMBOLS FAILURE_BUCKET_ID: BAD_INSTRUCTION_PTR_c0000005_AkrFiltr.dll!Unloaded Followup: MachineOwner
This is a pretty straight forward stack and as a matter of fact a pretty straight forward dump. This module was causing the service to crash due an access violation (c0000005), as a result the whole process was going down. The solution was provided by the third party vendor owner of this module (an update).
For more references about crashes on ISA Server also see:
I wrote many posts in this blog about troubleshooting crash and hangs issues. In some of those posts (here it is one example) I used a tool called DebugDiag in order to either capture the dump or perform the initial analysis. Today the team that it’s developing DebugDiag announced a new version of this tool, I’m talking about DebugDiag 1.2. Below is the list of the new features introduced in this version:
Analysis Automation
Data Collection
Deployment Options
Note: It is important to mention that you must uninstall all previous DebugDiag versions before you install DebugDiag 1.2.
Go get DebugDiag 1.2 at http://www.microsoft.com/download/en/details.aspx?id=26798
This week I’m attending to TechReady in Seattle, where I will present two sessions about On-Premise Security while Migrating to the Cloud (matter of fact already presented one today). For the last two days I talked with some folks from the field and I got common comments about their talking points with customers when the subject is migrating to the cloud. The common question usually is: where is my data when it moves to the cloud? This is a great question, but instead of write about this, why not watch a video that explains in details the Microsoft Datacenter for Online Services? Sure, no problem…..enjoy the video below about that:
Last May I went to a Security Conference here in Dallas called Takedowncon, organized by EC-Council. It was a great conference, great speakers and an amazing technical content. I personally recommend you to participate in the next stop of TakeDownCon, which will be in LA next December. I’m here today just to share one of the presentations that folks from TakeDownCon made available for public consumption this week:
Process And Memory Forensic Techniques by Kevin Cardwell
More presentations from TakeDownCon Dallas 2011 can be found at http://www.youtube.com/TAKEDOWNCON2011
Enjoy it !