• TMG performance issues-another one in that series-maxconcurrentapi

    This post is an addition to the series of posts that I have written on TMG/ISA performance issues. Every time I get a case on TMG performance issue, there is always some new twist to the scenario. But it always circles round few known root causes e.g. name resolution, authentication issues with authenticating servers, sometimes CPU spikes due to various reasons e.g. excessing logging, at times Drive latency issues etc.

    So whenever i get  case where we have performance issue, these root causes are i keep looking for . This time around , this issue hit authentication piece. So the TMG server would stop responding intermittently. So i sent action plan as per my blog post http://blogs.technet.com/b/isablog/archive/2011/01/13/random-authentication-prompts-while-accessing-internet-through-isa-server-followed-by-isa-server-becoming-unresponsive.aspx for data collection .

    In the perfmon logs I found backlogged packets counter spiking and as we know(please refer to my previous posts on this topic for more info) this can happen either because of authentication or name resolution issues. Then I looked at the netlogon logs collected, pasted one snippet below

    ******************************************************************************************************

    [CRITICAL] contoso: NlAllocateClientApi timed out: 0 253

    [CRITICAL] contoso: NlpUserValidateHigher: Can't allocate Client API slot.

    [LOGON] SamLogon: Network logon of contoso\jsmith from x Entered

    [CRITICAL] contoso: NlAllocateClientApi timed out: 0 253

    [CRITICAL] contoso: NlpUserValidateHigher: Can't allocate Client API slot.

    [CRITICAL] contoso: NlAllocateClientApi timed out: 0 253

    [CRITICAL] contoso: NlpUserValidateHigher: Can't allocate Client API slot.

    [LOGON] SamLogon: Network logon of contoso\testuser from x Returns 0xC000005E

    ******************************************************************************************************

     

     

    There are a lot of entries with the status 0xc000005e  In the err.exe did lookup for this hex code  and found following :

     

    *******************************************************************************************************

    err 0xc000005e

    # for hex 0xc000005e
    / decimal -1073741730 :

    STATUS_NO_LOGON_SERVERS ntstatus.h

    # There are currently no logon servers available to service

    # the logon request.

    # 1 matches found for "0xc000005e"

    ******************************************************************************************************* 

    We were also getting  "NlAllocateClientApi"

     

      Article  http://support.microsoft.com/kb/2688798 explains this behavior

    "

    When you discover authentication time-outs or delays (also known as MaxConcurrentApi bottlenecks) in an environment, the typical way to resolve the problem is to raise the maximum allowed worker threads that service that authentication. You do this by altering the MaxConcurrentApi registry value and then restarting the Net Logon service on the servers

    "

     

    we can clearly see the timeouts in our netlogon logs so to resolve this we planned to increase the value of maxconcurrentapi registry key on the problem TMG server and increased it to 10 after that issue did not occur. I also suggested the admin that if problem recurs then we can calculate new value from the perfmon logs and if needed use perfon logs to understand how domain controllers are behaving. I found a nice post that explains similar problem that was due to domain controller http://blogs.msdn.com/b/spatdsg/archive/2006/01/05/507299.aspx

     

  • Delay in loading page of a website published through UAG

    When we publish a website(including exchange services and share point services), UAG does lot of processing ,in fact parsing of the content and links in the pages ,when a user accesses  these pages through UAG.

    I came across a case, where we experienced delay after user logs in to the website and the page loads on the browser.

    To understand where delay was happening, I used a tool called HTTPwatch on the client side and for deeper fact finding on a possible cause on server side, took UAG trace, with all filters. You can refer to Ben ari's blog post about UAG tracing http://blogs.technet.com/b/ben/archive/2010/09/03/uag-tracing-made-simple.aspx.

    In the HTTPWatch I found delay in the URL, which we can see in following snapshot.

     

    Then in the UAG tracing I found delay when this page was getting parsed

    ****************************************************************************************

    Info:MatchServer(a.contoso.com, 80, 0): took 5566 milliseconds

    Info:MatchServer(b.fabrikam.com,80, 0): took 20 milliseconds

    Info:SearchAndReplaceURLLocation(/examplepath/):took 5676 milliseconds

    Info:SearchAndReplaceURLLocation(/examplepath/): took 4565 milliseconds

    Info:MatchServer(x.fabrikam.com,80, 0): took 5253 milliseconds

    Info:ProcessBuffer(/examplepath/):took 8556 milliseconds

    Info:took 8556milliseconds

    Info:SearchAndReplaceURLLocation(/examplepath/examplefile):took 25 milliseconds

    ****************************************************************************************

    Note: simplifying the numbers,  5676 milliseconds is actually 5.676 seconds.

    Here while parsing UAG is trying to resolve names of servers/hostnames in the links it encounters in the page its parsing, here our problem page.

    I confirmed this delay in the network traces taken at the time of the issue, putting one of the name resolution conversation from network traces below, we can see delay of 11 seconds and finally response from DNS server as server failure i.e. it could not resolve name.

    *******************************************************************************

    20:47:29 PM 1/2/2013  xxxx      DNS               DNS:QueryId = 0xA888, QUERY (Standard query), Query  for a.contoso.com of type Host Addr on class Internet           {DNS:xx, UDP:xx, IPv4:x}
    20:47:40 PM 1/2/2013  xxxx      DNS               DNS:QueryId = 0xA888, QUERY (Standard query), Response - Server failure                {DNS:xx, UDP:xx, IPv4:x}
    *******************************************************************************

     So provided following suggestions to the Admin

    1. for testing and as work around, skip body parsing for this URL(seen in httpwatch, where we see delay).

    2. Get the name resolution fixed on the DNS server or use host file for the names which are not getting resolved, due to which delay in loading of page is happening.

    Our testing of skip body parsing for this URL worked and we got rid of the delay, then Admin worked on the DNS name resolution for permanent fix of the problem.

     

  • ISA/TMG memory Pool Depletion and performance issues.

    I worked on a case recently where ISA server's service was stopping and hanging intermittently and Administrator had to reboot it to get it to work again.

    As usual with such performance issues, We put the performance counters on the ISA server , Please refer to my blog post, for TMG, http://blogs.technet.com/b/sooraj-sec/archive/2013/01/07/tmg-performance-counters-template.aspx and for ISA server http://blogs.technet.com/b/sooraj-sec/archive/2010/09/12/isa-server-stops-responding-the-user-requests-and-its-required-to-reboot-the-isa-server-or-restart-the-firewall-service-scenario1.aspx.

    Ideally we should let performance counters to run for at least a day to understand the trends, in scenario like the one i m discussing, we should have the perfmon data collected when issue was experienced.

    I looked at data and found memory pool for SSL counter to be dropping from 100 percent to 0 during high load as we can see, counter highlighted in black , in following snapshot.

     

     

    This is a memory pool that handles SSL connections, We have a KB article(http://support.microsoft.com/kb/842438) that explains the problem that can happen because of it.

    Following section  in above KB provides detail about it

    "

    For the Secure Sockets Layer (SSL) connection   request pool, no estimation formula is available now. However, you can obtain   the current pool size from the event data and pick a reasonably larger number.   For example, doubling the pool size is a reasonable estimate at the first   time.

    "

    so in this scenario, we doubled its value and then monitored the server, After putting the regkey for this counter we did not experience this issue.

    I have also written another blog post for memory pool for http http://blogs.technet.com/b/sooraj-sec/archive/2011/01/10/eventid-31212-proxyvmemalloc3psize-registry-value-calculation.aspx .