Over the past few years we’ve learned more about “NTLM and MaxConcurrentApi Concerns” and we’ve even come up with some new ways of addressing them. The starting point for learning more is the Knowledge Base article
You are intermittently prompted for credentials or experience time-outs when you connect to Authenticated Services.
Although not one of our highest volume issues we get our customers calling about there is one complex scenario that seems to me would be a winner if we handed out prizes to problems that took longest to resolve.
That scenario is NTLM client to server authentication in a distributed forest environment. Let’s talk about this in more detail.
One of the reasons Kerberos was such a great leap forward compare to NTLM was that we can impersonate a user when trusted to do so, thereby eliminating the need to go trekking all the way back to an authority (domain controller) in order to authenticate access to a resource. This doesn’t sound like much of a big gain overall at first look but let me paint a little picture for you on that. We’re going to use Microsoft Internet Security and Acceleration Server (ISA) in conjunction with Internet Explorer 6 as an example in this case but this behavior can occur with any product which uses NTLM authentication in a similar manner.
So let’s consider that you are using ISA to work as a web proxy for your web clients. In other words, your user’s internet web traffic must pass through the ISA server. User requests web page, and the web page request proxies through the ISA server, as does the response. Essentially…I’ll be the first to confess that ISA is not my strong suit. Regardless I’m about to plunge in and explain this scenario so please be a little forgiving if I misstate an ISA specific aspect somewhat.
Joe the user attempts to connect to a web page on the internet. Let’s be optimistic and assume that this is a work related web page that is hosted “in the cloud”. The sequence of events that happens from the authentication aspect are roughly the following:
1. Joe opens his web browser and types in the destination URL.
2. Joe’s computer sends HTTP traffic which goes to the proxy server. This traffic includes Joe’s integrated authentication information-in other words Joe’s credentials-using NTLM authentication as the method.
3. The proxy server in turn needs to verify Joe’s credentials. It does this by sending traffic to the domain controller is has a secure channel to requesting a verification of the user credentials.
4. That domain controller responds to the ISA server with the information on Joe’s credentials.
5. The ISA server retrieves the requested web page from the internet and serves it back to Joe’s computer (assuming Joe has sufficient access to allow him access to the web).
Well, that sounds straightforward enough, doesn’t it? What’s the problem?
The problem lies in that each web connection from a client will follow this same process in the NTLM scenario, repeating step 3 over and over again. This makes for a high volume of individual authentication “transactions”.
The Netlogon service, among other things, takes care of passing NTLM authentication requests to a domain controller that can handle them, and receiving them on that domain controller to be handled. Since the Netlogon service runs within the Lsass.exe process the authentication is ultimately handled by Lsass.exe. Within that process are threads, which can be thought of as the little workers that run the code. For NTLM authentication specifically there is a set number of threads which will handle that request and answer. The setting which governs how many threads can be used for that is called MaxConcurrentApi. The defaults are typically 1 for this, meaning that there is one thread to hand off, receive and process these requests.
The MaxconcurrentApi thread can only deal with one authentication at a time, basically, though that is very very quick. So this high volume of authentication transactions must be handled by one or two threads (by default), With a sufficiently high volume a bottleneck effect will appear, resulting in some of these transactions to wait longer than a remote client can tolerate.
The net result of that bottleneck in our ISA scenario would be that Joe’s browser would pop up a credential prompt rather than the external web page he was hoping for. Queue Joe’s immediate help desk call.
This is a scenario which is commonly treated by increasing the number of threads by altering the MaxConcurrentApi setting per the steps in this article. The maximum recommended number is 5, though it can be set to 10. This is a setting which can help if set on application server (ISA in this case) and domain controller involved.
Now let’s add a twist to the above sequence of events and stipulate that, though MaxconcurrentApi has been increased on ISA servers and domain controllers involved in this scenario, the end users are still getting credential prompts intermittently.
Additionally, what is some of the high volume of users who are web proxying through ISA server are actually users from a domain other than the one the ISA server is in (assuming there is a working trust between those domains)? For users who are in the same domain as the ISA server we expect the authentication traffic to go to the domain controller which the ISA server has a secure channel to. However with the users who are from another domain the authentication request is sent to the same domain controller the ISA has a secure channel…but that domain controller cannot fulfill that request since it doesn’t know of this user.
The domain controller is aware of a trust where this user may reside (since the user’s name has appeared in the format of domain\username) and so forwards this request on to a domain controller from the trusted domain. This DC must in turn fulfill this request and send the response back to the originating domain’s DC.
Besides being a complicated story what’s the big concern here? Well, while the authentication request is being processed by the trusted domain DC one of the MaxConcurrentApi threads is waiting for the response from the domain controller in that users remote domain and cannot service other authentication requests while waiting. Once the ISA servers DC receives the response from the remote domain DC then it can respond to the ISA server and that thread is free for additional requests. This can exacerbate the bottleneck issue and lead to more of those credential dialogs popups appearing to the end user as they try to get access to web pages on the internet via the proxy server.
Even with the complicated scenario above we in support infrequently get calls on this sort of problem. The product is simply very robust.
No, what usually tips the scale to make the above scenario bad enough to become a significant and noticeable pain is when the above is happening (and not necessarily with ISA as the product using NTLM) and the domains are being “run down” by legacy API calls. Run down is a term we sometimes use to describe performance degradation on a server. What we have seen in the past is that the above scenario really becomes a concern if there are scripts or other custom applications in their environment using legacy code. Legacy code in this context means code or scripting that was originally designed for Windows NT; any WinNT: provider code would likely fit this bill. Generally speaking this legacy code does not perform as quickly on a server and for authentication that code will invariably use NTLM.
You can visualize the NTLM authentication requests as cars backed up in a traffic jam, where the delay results in things backing up all the way to the web client, resulting in a credential prompt. This isn't a perfect analogy but gets the point across. The role of the administrator or support person is to figure out what the holdup is which is causing the traffic jam and to resolve it so that traffic moves quickly once more. But identifying this issue can be pretty difficult.
Here’s what you can do.
First, events related to this authentication will appear in the Netlogon log of the ISA server and any domain controller involved in servicing the authentication requests. The steps on how to enable that logging are here, and the relevant debug entries in the netlogon.log will look like this when the problem is occurring:
Time [LOGON] SamLogon: Network logon of DomainName\UserName from WorkstationName Returns 0xC000005E
The success entry will be similar to the above however the last portion will read “Returns 0x0”.
At its core this issue is one of performance degradation, or run down, on the domain controllers and we have some good techniques to use there. Server Performance Advisor’s (SPA) AD Data Collector should be ran as the issue is occurring (users see authentication issues) on the domain controllers which the ISA servers have secure channels to. It is highly likely that the SPA report will show that the DC was not busy. This issue is one that may not show up as a performance problem as traditionally thought since it is most like a delayed response. However, SPA may show you that highest caller was some function that starts with “SAM”. That’s a clue there, as would be which computers are sending those requests to it.
Next you can consider installing a hotfix on the domain controllers involved. This hotfix adds performance counters for the Netlogon service which can be used to better understand how many NTLM authentications are occurring, if any are failing, and give an idea of what domains the users who are being authenticated are in. The domains information helps us understand how many of the auth requests will cause that possible delay while waiting for the remote domain’s DC to reply.
Network captures are also a good tool for this however unless you already have an idea of what to look for you will have difficulty gleaning data from them.
For example, the ISA scenario above occurred where our customer had a legacy script running just about everywhere, all the time. We got a hint that there was a legacy script, and what it was doing, in our SPA test. Using the name of the function call we spent many hours tying network traffic from when the issue was occurring (taken from the DCs involved) to the script functions seen in the SPA report. Once we had that tied together we could then focus on the workstations, application servers, terminal servers and all where this script was running since we had the source IP of the computers running it.
There may be application side considerations that help alleviate this concern as well. Kerberos usage, rather than NTLM, will generally make this issue go away or never appear in the first place. In the ISA case there is an excellent Technet article here with more detailed and application specific information.
Larger domain environments which use NTLM authenticating applications are more likely to see this sort of scenario but it is not a widespread concern by an stretch of the imagination. For those who do see this issue it can be something that can be time consuming and frustrating until you get a handle on the problem but it’s ultimately resolvable. And you can always call us at Microsoft support to help-that’s why we’re here.
PingBack from http://www.ditii.com/2008/09/24/troubleshooting-ntlm-and-maxconcurrentapi-concerns/
Another good blog Tim, what I like about your blog is that you cover things that I don't see a lot in other places.
More symptoms for this post from the netlogon.log:
12/10 11:10:09 [CRITICAL] <DOMAIN>: NlAllocateClientApi timed out: 0 258
12/10 11:10:09 [CRITICAL] <DOMAIN>: NlpUserValidateHigher: Can't allocate Client API slot.