A while back I got involved in an issue where a company had written an application for their own internal use. This application was intended to connect to a web page served out by IIS and then proceed with some other action. They wisely chose to make sure that the initial connections it used were secure on the network. A method they wanted to use for that was secure socket layer (SSL) communication.
This company’s code ran within the IIS application pool and simply called into the underlying Windows SSPI in order to kick things off. We’ll go over a little of this in case you’ve never had the opportunity to delve into the details of how this works. SSL and transport layer security (TLS) typically use public key interchange (PKI) based certificates in Microsoft’s and most other implementations to provide the private and secure pieces of data to essentially guarantee the communicator’s authenticity and security. In other words, the certificates that are present are important.
What our customer saw-the reason they were calling us-was a consistent approximate 20 second delay on initial connection to the web page. This would only happen the first time a user connected to the page; subsequent connections in the same user logon session of that workstation saw no delay. Were they to log off and back on however they would see the problem recur.
Network traces were the first thing we wanted to look at. Filtering for authentication based traffic can be a tricky thing since authentication can be used for a variety of other network protocols. Not all network traffic analyzers will filter for the authentication used and thus catch all traffic for it regardless of the protocol which is needing to be authenticated. The common thing we see this in is filtering for Kerberos authentication, but it applies here as well.
In this scenario, we filtered for SSL.
The traffic we saw showed a distinct 20 seconds in delay after a server packet arrived to the client, before the client responded. A little background understanding…
Since SSL derives this base security from certificates, there is an exchange that happens as client provides the basics on what certificates it has to the server, and server to the client, so that they can decide on what certificates can be used to establish this secure connection. Keep in mind that not all certificates have the same capabilities, so some may not work well. In addition, they often need to be checked for revocation. Also keep in mind that ideally, an SSL connection would like to use both a client and a server certificate to for this security, but if necessary only server side will be used if client doesn’t “qualify” or exist.
Naturally our first thought was that perhaps we were checking for the certificate revocation list for installed certificates on the client (since the delay was so clearly seen on the client side in the network traffic). For those installed certificates, however, there was no outbound traffic for the defined CRL in the certificates. So, that wasn’t it.
At this point we did a lot of head scratching.
Our next thought was to use Filemon.exe and Regmon.exe on that client as it did this. It was a little tough to filter but we saw that it was trying to access the local certificate store on the computer. This is a file location of %userprofile%\Application Data\Microsoft\Crypto for user certificates, and \All Users\Application Data\Microsoft\Crypto for the computer enrolled certificates. These tools really didn’t give us a thorough enough picture to conclude anything. By the way, here’s what regmon showed us:
7943 7:51:31 PM w3wp.exe:4080 OPEN C:\Documents and Settings\All Users\Application
-43cf-bb20-39161a0a0108 ACCESS DENIED NT AUTHORITY\NETWORK SERVICE
7944 7:51:31 PM w3wp.exe:4080 OPEN C:\Documents and Settings\All Users\Application
-43cf-bb20-39161a0a0108 NOT FOUND Options: Open Sequential Access: All
Our next venture was to use DebugDiag, an IIS specific troubleshooting tool, so we could see what is happening under the hood. I don’t profess to be an expert in using this tool, or even to be someone who uses it often, but it helped us here. Here’s a link on how to get this tool, and what to do with it:
This tool helped us on the client side by capturing lsass.exe. When we looked at those little mini dumps we saw that we would consistently see a thread looking for objects in the \All Users\Application Data\Microsoft\Crypto path. This was still not a smoking gun, though, since as you all know, a dump is just a capture of what is happening at that nanosecond. So, like a snapshot picture taken too late at a race finish, we could’ve missed the good shot.
But seeing this made me think of a hypothesis: what if the delay is a timeout period in our code when there is a problem accessing the local certificate store for some (unknown) reason? First culprit to check would be permissions if that was it.
So, we dumped the permissions of that directory and subdirectories using SUBINACL.EXE. You can get that tool from the link below, and I believe it is in the resource kits as well.
The application was running in the IIS app pool (w3wp.exe process). Things running in the app pool apparently will use the context of NT Authority\Network Service to do the SSL session setup. Particularly in later versions and service packs. In the permissions export, you will want to look for the application pool context (I suggest starting with NT Authority\Network Service) and see if those permissions are present. To prevent blinding headaches I suggest dumping the same permissions on a default install client and see what they show for comparison purposes.
Our conclusion was that SSL session setup does a check of the local certificate store for the computer and that the 20 second delay was the amount of time which the CryptoAPI will take before timing out in trying to access the local computer's machine certificate store and doing a server based SSL session rather than client-server SSL.
The customer had removed the default permission on this directory; they removed the Network Service’s permissions altogether. Although I was never clear on why that was done, there was no glaringly obvious reason not to do that. At least, until we figured this issue out.
As you’ve probably concluded by now, we put these permissions back to default and immediately saw the difference.
For those out there who are trying to match this to their own real world scenario I’d like to point out that there can be variations on this theme. The certificate can be a user based one and it could be trying to reach the cached CRL URL objects, or any of the associated objects in these child directories. The key to start with on this type of thing is:
-Do a trace to see whether the delay is client side or server side.
-Once determined, do filemon.exe as you see the difficulty.
Maybe the trace will send you to another route, like a just-in-time reply from a CRL. Who knows? That’s the cool thing about what we do-each issue is different. But you can apply these techniques, tools, steps to whatever similar issue comes across your desk and maybe that’ll help get things moving smoothly for you.
Happy New Year!
This article reminds me of a problem with a custom Visual Basic to SQL server issue we have had from the last 6 years.
The first time you do a lookup there is a 20 second delay. After that similar lookups are almost instantaneous. If you go into an area of the application that uses different tables (i.e. you go from Sales Orders to Accounts Payable) you will see the delay again on the first lookup in that area and then things are quick. A long period of being idle can also cause the 20 second delay to happen.
It used to be a 60 second delay but several hardware changes have dropped it down. Changes like dedicated servers for domain controller, file server, dns, gigabit lan, faster SQL Server, faster switches etc. but the initial problem still exists. We have also configured clients to only use tcp/ip to the sql server.
The Company that wrote the application have not been alot of help even though they have re-written alot of code, stored procedures and built different indexes to try and help.
Throwing hardware at it has helped some but it really hasn't solved the problem.
I think you're first step is to get an unfiltered network trace of the problem occuring.
Once you have that you can filter for SSL if using Ethereal or Wireshark. There will be a time index in the filtered packets that will show when the gap of time is happening. It should also give an idea of which computer is giving the gap (like, is the client waiting for that period for responding?).
Be sure to remove the filter and look at that traffic again after though-it may reveal some out- or in-bound traffic which may help give you an idea. Like, is the client checking something in that interval?
That's a start. Let us know what you see...
A while back I posted about troubleshooting a problem where a customer had seen a home grown application