Friends, for now i would say there are few prerequisites to understand following material i.e. you should have basic understanding of how to set up UAG DA and its concepts, so target audience is admins and peers who are already working on the UAG DA and can run into following scenario. I m planning to put links about basics of UAG DA here for everybody's understanding later.
Scenario discussion
I was setting up lab for a UAG DA Scenario and found that DA client was not connecting.So i checked if tunnels were by running following command
netsh advf monitor show mmsa
and found none of the two tunnels were up as i did not see main mode SA(security Associations) established, since Main mode did not establish ,quick mode cant. which meant, we had to start focusing on the IPSEC part, of the whole UAG DA technology which we know uses multiple technologies to implement DA, including IPSEC, apart from others like IPV6 ,Advance windows firewall etc.
so before digging deeper i took network captures on the client machine and ran following two commands in the command prompt.
net stop iphlpsvc (to stop IP helper service)
net start iphlpsvc (to start IP helper service)
purpose of doing this is to restart the IP helper service as it would try to start DA connectivity and would try to establish infrastructure tunnel and then corp tunnel. This would also mean that IPSEC SAs at main mode would also try to establish and that's where we wanted focus to start with.
In network captures got following error
ERROR_IPSEC_IKE_AUTH_FAIL in the network captures. Where i filtered the trace with protocol.authip i.e. with AuthIP protocol in network monitor .
This was not very descriptive so i took scenario tracing as below
Run following two commands in the command prompt
Then
to initiate the DA connectivity again.
Then stopped the traces by running following two commands in the command prompt
Then opened the client.etl file with network monitor and found following in it
i.e.
WFP:IPsec: Main Mode Failure - Error: ERROR_IPSEC_IKE_AUTH_FAIL WFP WFP:IPsec: Main Mode Failure - Error: CERT_E_WRONG_USAGE
From Ipconfig details of the client , since client was on public internet it was using 6to4 adapter, so my client with IPV6 IP address 2002:2828:2803::2828:2803 was sending that error regarding the machine certificate usage to the UAG DA server.
**********************************************************************
Tunnel adapter 6TO4 Adapter: Connection-specific DNS Suffix . : Description . . . . . . . . . . . : Microsoft 6to4 Adapter Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0 DHCP Enabled. . . . . . . . . . . : No Autoconfiguration Enabled . . . . : Yes IPv6 Address. . . . . . . . . . . : 2002:2828:2803::2828:2803(Preferred) Default Gateway . . . . . . . . . : 2002:2828:280b::2828:280b DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1 fec0:0:0:ffff::2%1 fec0:0:0:ffff::3%1 NetBIOS over Tcpip. . . . . . . . : Disabled
That made me to check the machine certificate on the UAG DA server, which looked right, ran following command in the command prompt on the UAG DA server.
certutil -store my
and it showed me that certificate was OK.
just to be on the safer side, i removed this machine certificate from the computer certificate store of the UAG DA server.
Then requested this machine certificate again from the certificate Authority and then again restarted the
IP helper service as explained above to restart the DA connectivity.
This time DA connected fine, Network captures were clean our AuthIP traffic in the network captures showed that authentication was successful.
Tested DA by connecting to internal resources and it worked.
Hi folks, Uploading a presentations for UAG admins and my peers. This is to provide more information around this topic, i m pretty sure there must be documents around it already but i have tried to give my perspective to it.
I will add audio /video to it in later revisions. I am also discussing here the scenario I discussed in the presentation.
This post is once again about an issue I worked on few days back. Before I start discussing about the issue and how I resolved it, I would like to mention that objective of this post is to make TMG admins aware of the issue and what can be done to resolve it, The steps performed to determine the root cause of the issue e.g. user mode dump analysis can’t be done without the symbols(which are private) ,so idea is not to help in performing the dump analysis, instead I want to share the details of dump analysis to show at the time of boot, why TMG services can get hung and won’t start . If you are familiar with terms like process, threads and its stack then you can read it by yourself but if you are not I will explain the observations from it.
Issue:
TMG server admin was rebooting the server and at the time of reboot TMG services were hanging and were not starting. A similar issue was reported pre TMG sp2 but it was fixed post sp2, in this scenario TMG was updated to latest build i.e. TMG sp2 RU2.
Troubleshooting:
Some background: Lot of work has already happened, before I started working on this issue, so in such scenarios you understand the issue and check the steps that have already been taken to resolve the issue and move from there e.g. steps taken in this http://support.microsoft.com/kb/2659700 were performed already . We were getting following event id
_____________________________________________________________________________________________
Log Name: System
Source: Service Control Manager
Date: 09/11/2012 17:42:30
Event ID: 7022
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: server1
Description: The Microsoft Forefront TMG Firewall service hung on starting.
Event Xml: <Event
xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"EventSourceName="Service Control Manager" />
<EventID Qualifiers="49152">7022</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8080000000000000</Keywords>
<TimeCreated SystemTime="2012-11-09T17:42:30.378163900Z" />
<EventRecordID>344470</EventRecordID>
<Correlation />
<Execution ProcessID="716" ThreadID="720" />
<Channel>System</Channel>
<Computer>server1</Computer>
<Security />
</System>
<EventData>
<Data Name="param1">Microsoft Forefront TMGFirewall</Data>
</EventData>
</Event>
Data collection
During the course of troubleshooting we collected user mode dump while trying to restart the services in automatic startup mode, when it got hung again.
User mode dumps collection reference: http://msdn.microsoft.com/en-us/library/ff420662.aspx
Data analysis
Approach taken in this post is very similar to guidelines given in the following link about debugging a deadlock as we were in a scenario similar to a deadlock http://msdn.microsoft.com/en-us/library/windows/hardware/ff540592(v=vs.85).aspx
In the dump found following critical section was locked
More about critical section and locked critical section refer: http://msdn.microsoft.com/en-us/library/windows/hardware/ff541979(v=vs.85).aspx
Then I located the owning thread of this locked critical section. In following snapshot we can see the stack of this thread, Stack is read from bottom to upside, From this call stack it appears that wspsrv (firewall service) is trying to load a filter called XSISAPI and has deferred its filters start up till this filter is loaded.
Then I checked the module for this filter XSISAPI and found following, That is, it’s a filter called Afaria from Sybase.
Solution:
We configured the XSISAPI filter service to delayed start and after that TMG services started normally after reboot.