Suraj Singh's information Security Blog

For people who work on information Security.

December, 2012

  • TMG services hang at startup due to third party service.

    This post is once again about an issue I worked on few days back. Before I start discussing about the issue and how I resolved it, I would like to mention that objective of this post is to make TMG admins aware of the issue and what can be done to resolve it, The steps performed to determine the root cause of the issue e.g. user mode dump analysis can’t be done without the symbols(which are private) ,so idea is not to help in performing the dump analysis, instead I want to share the details of dump analysis to show at the time of boot, why TMG services can get hung and won’t start . If you are familiar with terms like process, threads and its stack then you can read it by yourself but if you are not I will explain the observations from it.

    Issue:

    TMG server admin was rebooting the server and at the time of reboot TMG services were hanging and were not starting. A similar issue was reported pre TMG sp2 but it was fixed post sp2, in this scenario TMG was updated to latest build i.e. TMG sp2 RU2.

    Troubleshooting:

    Some background: Lot of work has already happened, before I started working on this issue, so in such scenarios you understand the issue and check the steps that have already been taken to resolve the issue and move from there e.g. steps taken in this http://support.microsoft.com/kb/2659700 were performed already . We were getting following event id

    _____________________________________________________________________________________________

    Log Name: System

    Source: Service Control Manager

    Date: 09/11/2012 17:42:30

    Event ID: 7022

    Task Category: None

    Level: Error

    Keywords:
    Classic

    User: N/A

    Computer: server1

    Description: The Microsoft Forefront TMG Firewall service hung on starting.

    Event Xml: <Event

    xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

    <System>

    <Provider Name="Service Control Manager" Guid="{555908d1-a6d7-4695-8e1e-26931d2012f4}"
    EventSourceName="Service Control Manager" />

    <EventID Qualifiers="49152">7022</EventID>

    <Version>0</Version>

    <Level>2</Level>

    <Task>0</Task>

    <Opcode>0</Opcode>

    <Keywords>0x8080000000000000</Keywords>

    <TimeCreated SystemTime="2012-11-09T17:42:30.378163900Z" />

    <EventRecordID>344470</EventRecordID>

    <Correlation />

    <Execution ProcessID="716" ThreadID="720" />

    <Channel>System</Channel>

    <Computer>server1</Computer>

    <Security />

    </System>

    <EventData>

    <Data Name="param1">Microsoft Forefront TMG
    Firewall</Data>

    </EventData>

    </Event>

    _____________________________________________________________________________________________

     Data collection

    During the course of troubleshooting we collected user mode dump while trying to restart the services in automatic startup mode, when it got hung again.

    User mode dumps collection reference: http://msdn.microsoft.com/en-us/library/ff420662.aspx

    Data analysis

    Approach taken in this post is very similar to guidelines given in the following link about debugging a deadlock as we were in a scenario similar to a deadlock http://msdn.microsoft.com/en-us/library/windows/hardware/ff540592(v=vs.85).aspx

     

    In the dump found following critical section was locked

    More about critical section and locked critical section refer: http://msdn.microsoft.com/en-us/library/windows/hardware/ff541979(v=vs.85).aspx

    Then I located the owning thread of this locked critical section. In following snapshot we can see the stack of this thread,  Stack is read from bottom to upside, From this call stack it appears that wspsrv (firewall service) is trying to load a  filter called XSISAPI and has deferred its filters start up  till  this filter is  loaded.

     

     

     

     

     

     

     

     

     

     

     

     

    Then  I checked the module for this filter XSISAPI and found following, That is, it’s a filter called  Afaria from Sybase.

    Solution:

    We configured the XSISAPI filter service to delayed start and after that TMG services started normally after reboot.

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

  • Presentation on UAG authentication and authorization,with a scenario discussion.

    Hi folks, Uploading a presentations for UAG admins and my peers. This is to provide more information around this topic, i m pretty sure there must be documents around it already but i have tried to give my perspective to it.

    I will add audio /video to it in later revisions. I am also discussing here the scenario I discussed in the presentation.

    Issue: if we point BaseDN of UAG authentication repository to  domain level then  portal access goes very slow,you cant get logged in, its kind of unresponsive.
     
    UAG Admin has configured user group based access on the applications published on the UAG server.
     
    Troubleshooting
    Lot of work has already happened, when this case same to me. I suggested the UAG admin, to collect network captures for 3-5 minutes, on the internal NIC of the UAG, after pointing BaseDn to root of the domain and when he starts facing the issue.
     
    Data Analysis
     
    • In the network captures, in the traffic between UAG and domain controllers , I saw window size advertised by the UAG server was gradually dropping, to zero for almost all the Global catalog traffic sessions,Please refer to the snapshot below.
    • Another observation was , we had huge amount of this LDAP/Global catalog traffic. It appeared that lot of data transfer was happening from domain controller to UAG and that can happen, if UAG forcing domain controller to do so with its certain
      way of querying.
    • So next thing that I wanted to see was the Repository configuration, specifically the BaseDN and include subfolders option as they can force UAG to do such queries.
    • I found that the include subfolder was set to 5.
     
     
     
    Action plan/workaround/solution
     
    • After seeing this, I made the include subfolder to 2 and asked admin to observe.But  I still saw same behavior in the network captures. So reduced the included subfolder to 0 and then unchecked that option (for testing)and then observed. That reduced the impact and problem did not come for a week .once again I looked at the traces when problem again came back. We still had similar pattern.
    • Then we had detailed analysis of the way admin has set up his active directory.
    • We redesigned active directory infrastructure little bit, after detailed discussions with admin, where we created a specific OU for UAG users and then added the OU, that had most of user groups in it, we also found out that these user groups were not nested user groups, users in these groups were members of other groups, but the group themselves were not nested in one another. So we also removed the check box of nested level so that searches for groups within groups is not done as that also create huge amount of queries. After that issue did not happen.
    • Learning from this case was that if we reduce the search scope, we reduce amount of queries made to domain controller and also the resultant amount of data that was causing UAG to choke as we saw in the network traces.
     
    Few ideas
     
    • We can have more then one repositories in UAG
    • We can use more then one repository on a trunk.
    • We can use different repositories and the user groups in them to authorize users on applications on a trunk.
    • Using these ideas , if we run into a performance issue where admin has a huge AD infrastructure and they have performance issues (due to huge queries and query responses by domain controller)if they use root of the domain for search in BaseDN, we
      can suggest a way that would reduce the scope of search of queries by pointing to a OU that’s related to the UAG access.

     

  • UAG DA client cannot connect, Error : ERROR_IPSEC_IKE_AUTH_FAIL in the network captures.

    Friends, for now i would say there are few prerequisites to understand following material i.e. you should have basic understanding of how to set up UAG DA and its concepts, so target audience is admins and peers who are already working on the UAG DA and can run into following scenario. I m planning to put links about basics of UAG DA here for everybody's understanding later.

    Scenario discussion

    I was setting up lab for a UAG DA Scenario and found that DA client was not connecting.So i checked if tunnels were by running following command

    netsh advf monitor show mmsa

    and found none of the two tunnels were up as i did not see main mode SA(security Associations) established, since Main mode did not establish ,quick mode cant. which meant, we had to start focusing on the IPSEC part, of the whole UAG DA technology which we know uses multiple technologies to implement DA, including IPSEC, apart from others like IPV6 ,Advance windows firewall etc.

    so before digging deeper i took network captures on the client machine and ran following two commands in the command prompt.

    net stop iphlpsvc                           (to stop IP helper service)

    net start iphlpsvc                          (to start IP helper service)

    purpose of doing this is to restart the IP helper service as it would try to start DA connectivity and would try to establish infrastructure tunnel and then corp tunnel. This would also mean that IPSEC SAs at main mode would also try to establish and that's where we wanted focus to start with.

    In network captures got following error

    ERROR_IPSEC_IKE_AUTH_FAIL in the network captures. Where i filtered the trace with protocol.authip i.e. with AuthIP protocol in network monitor .

    This was not very descriptive so i took scenario tracing as below

    Run following two commands in the command prompt

    •  Netsh trace start scenario=directaccess capture=yes report=yes tracefile=C:\client.etl
    •  Netsh wfp capture start

    Then 

    • net stop iphlpsvc                           (to stop IP helper service)
    • net start iphlpsvc                          (to start IP helper service) 

     to initiate the DA connectivity again.

    Then stopped the traces by running following two commands in the command prompt

     

    •  Netsh wfp capture stop
    •  Netsh trace stop

    Then opened the client.etl file with network monitor and found following in it

     

    i.e.  

    WFP:IPsec: Main Mode Failure - Error: ERROR_IPSEC_IKE_AUTH_FAIL
    WFP WFP:IPsec: Main Mode Failure - Error: CERT_E_WRONG_USAGE

    From Ipconfig details of the client , since client was on public internet it was using 6to4 adapter, so my client with IPV6 IP address 2002:2828:2803::2828:2803 was sending that error regarding the machine certificate usage to the UAG DA server.

    **********************************************************************
    Tunnel adapter 6TO4 Adapter:
    
       Connection-specific DNS Suffix  . : 
       Description . . . . . . . . . . . : Microsoft 6to4 Adapter
       Physical Address. . . . . . . . . : 00-00-00-00-00-00-00-E0
       DHCP Enabled. . . . . . . . . . . : No
       Autoconfiguration Enabled . . . . : Yes
       IPv6 Address. . . . . . . . . . . : 2002:2828:2803::2828:2803(Preferred) 
       Default Gateway . . . . . . . . . : 2002:2828:280b::2828:280b
       DNS Servers . . . . . . . . . . . : fec0:0:0:ffff::1%1
                                           fec0:0:0:ffff::2%1
                                           fec0:0:0:ffff::3%1
       NetBIOS over Tcpip. . . . . . . . : Disabled
    
    **********************************************************************
    That made me to check the machine certificate on the UAG DA server, which looked right, ran following command in the command prompt on the UAG DA server.
    certutil -store my
    and it showed me that certificate was OK.
     
    just to be on the safer side, i removed this machine certificate from the computer certificate store of the UAG DA server. 
    Then requested this machine certificate again from the certificate Authority and then again restarted the
     IP helper service as explained above to restart the DA connectivity.
    This time DA connected fine, Network captures were clean our AuthIP traffic in the network captures showed that authentication was successful.
    Tested DA by connecting to internal resources and it worked.