Real Case: IIS and Asp.Net hanging or poor performance (intermediary)
This week I'm working on this case related to IIS6 server hosting an Asp.Net application (it's a C# app) whose the performance is poor even when a very low number of users is hitting it. The architecture is the classical web layer accessing the backend DB servers and my goal is to isolate the problem, i.e pinpoint where the problem resides and try to find out why.
The first step after understanding the environment and being aware of all relevant servers and infrastructure devices, is to actually see the problem happening to have a better understanding of what is going on. We had one single user accessing the application and there are indeed some actions the user takes on the pages that take a long time to return, so the next step is to establish a baseline and agree on a SLA for the performance the application is supposed to have.
Another very important relevant point to this scenario is that since the environment counts on a NLB composed by multiple IIS servers, we need first to isolate which IIS server the client will connect to when we start trying to repro the problem. So after checking their NLB configuration and making sure they have the affinity set to single we can just open the IIS logs of the servers and search the client IP at the c-ip field. Since the affinity is single the server which has been serving the client will continue to be the same.
Once we know which server we will work with during the tests, my favorite tool to create the baseline is the System Monitor (perfrmon.exe) and we can also make some basic analysis of the IIS logs. So I set the perfmon with some specific counters that will help me to establish a baseline on how their app is currently behaving and also made sure they have the IIS log fields I need to work with when establishing the baseline.
The counters I used in the perfmon were the following:
|
ASP.NET v1.1.4322\Applications Running ASP.NET v1.1.4322\Request Execution Time ASP.NET v1.1.4322\Request Wait Time ASP.NET v1.1.4322\Requests Current ASP.NET v1.1.4322\Requests Queued ASP.NET v1.1.4322\State Server Sessions Active ASP.NET v1.1.4322\Worker Process Restarts ASP.NET v1.1.4322\Worker Processes Running
|
The fields I needed to make sure are being included in the IIS logs are the following:
|
c-bytes
s-bytes
cs-uri-stem
cs-uri-query
time-taken
|
After that we reproduced the problem, which by the way was really happening with only one user accessing it which already eliminates any hypothesis of the problem being related to the amount of load hitting the server, we started working on the baseline.
The counters in perfmon pretty much confirm what we see in the IIS logs: Their main application page takes an average time longer than 40 seconds to execute. The perfmon counter “ASP.NET v.1.1.4322\Request Execution Time” will report that as well as the IIS logs if you use the LogParser tool and run the following query against them:
First we need to consolidate all the IIS logs in a single CSV file:
|
LogParser "select date,time,cs-uri-stem,cs-uri-query,sc-status,sc-win32-status,time-taken,sc-bytes,cs-bytes from ex061024.log, ex061025.log to merged.csv" -o:CSV
|
Now we run the following query to get the average time-taken for the URLs in the logs and send the output to a new CSV file named AvgTimeTaken.txt
|
logparser "select cs-uri-stem,count(cs-uri-stem), avg(time-taken) from merged.csv to AvgTimeTaken.txt where time-taken>=10000 GROUP BY cs-uri-stem ORDER BY avg(time-taken) DESC" -o:CSV -i:CSV |
The result is as following:
|
cs-uri-stem |
COUNT(ALL cs-uri-stem) |
AVG(ALL time-taken) |
|
/App/MainPage.aspx |
15 |
48353 |
|
|
/App/MainPage.aspx |
3 |
37738 |
|
Ok, now that we see the problem happening and have our baseline it’s time to start working on isolating problem.
We don’t know what exactly the user does in order to have the mainpage.aspx taking so long to execute, so what we do is try to reproduce the problem again and keep following the perfmon counter “ASP.NET v1.1.4322\Requests Current”. This method of tracking the “Requests Current” counter will not work for all scenarios, since when you have the server under some load you will likely not be able to tell when a specific request is taking long or not because there will be many requests at the same time and they will keep changing the value of “Requests Current”. For this cases we would need to have the collaboration of the user (which would tell us when the page is taking long to return) or look at other counters like the “Request Queued”. However for this specific scenario where we have the variables under control and there is a single user testing the system for us, it will work just fine.
When the “Requests Current” shows 1 request and don’t get back to zero after, let’s say, 10 seconds, that’s the point where we know this request is taking too long to execute. Remember that our baseline is around 40 seconds… Now it’s right time to generate a hung dump of the worker process. To do that we first run the iisapp.vbs script in order to identify the PID of the worker process hosting our application pool and after that we can use any tool to simply generate multiple hang dumps (like Debugging Tools For Windows or Debug Diagnostics Tool (DebugDiag)) some seconds apart from each other. In this case I used the adplus.vbs from Debugging Tools For Windows with the following sintaxe:
|
Adplus.vbs –hang –p 2364 –quiet –o c:\dumps
|
This will generate a memory dump of the w3wp.exe whose the PID is 2364 and store it at the c:\dumps folder.
The dumps analysis will require just some basic .Net debugging steps (I used the Windbg from Debugging Tools For Windows to analyze them) as below:
This is a managed application (it’s aspx) so we want to make sure we have, besides the proper symbols, the sos.dll extension properly loaded:
|
0:000> .chain
Extension DLL chain:
aspvbscriptext.dll: API 1.0.0, built Fri Feb 06 18:20:44 2004
[path: C:\debuggers\aspvbscriptext.dll]
mdacexts.dll: image 2004.05.11.1, API 1.0.0, built Tue May 11 17:46:12 2004
[path: C:\debuggers\mdacexts.dll]
webdbg: API 3.5.5, built Thu Nov 15 18:26:28 2001
[path: C:\debuggers\webdbg.dll]
sieext: API 1.0.0, built Fri Nov 05 16:07:40 2004
[path: C:\debuggers\winext\sieext.dll]
dbghelp: image 6.6.0005.0, API 6.0.6, built Fri Apr 14 17:03:13 2006
[path: C:\debuggers\dbghelp.dll]
ext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:35 2006
[path: C:\debuggers\winext\ext.dll]
exts: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:30 2006
[path: C:\debuggers\WINXP\exts.dll]
uext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:37 2006
[path: C:\debuggers\winext\uext.dll]
ntsdexts: image 6.0.5357.0, API 1.0.0, built Fri Apr 14 17:21:31 2006
[path: C:\debuggers\WINXP\ntsdexts.dll]
clr10\sos: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:47 2006
[path: C:\debuggers\clr10\sos.dll]
|
Ok, so once we know we have the right extension loaded, let’s start finding which threads are being used to process the Asp.Net mainpage.aspx. We expect to see only one thread here since our user is simply accessing this single page. By looking at the first dump we see the following:
|
0:000> !aspxpages
Loaded Son of Strike data table version 5 from "C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\mscorsvr.dll"
Going to print all threads with ASP.NET Pages running.
Thread 11: /App/Mainpage.aspx
|
Ok, as we suspected there is only one thread processing the Asp.Net request and it’s the thread 11. Now let’s look on what this thread is currently doing:
|
0:011> !clrstack Thread 11 ESP EIP 0x0604f4dc 0x7ffe0304 [FRAME: NDirectMethodFrameStandalone] [DEFAULT] UI2 System.Data.Common.UnsafeNativeMethods/Dbnetlib.ConnectionRead(ValueClass System.Runtime.InteropServices.HandleRef,SZArray UI1,UI2,UI2,UI2,ByRef I) 0x0604f4f8 0x069cf5e4 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadNetlib(I4) ß it seems we have some SQL stuff on the stack… 0x0604f558 0x069cf531 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadBuffer() 0x0604f560 0x069cf463 [DEFAULT] [hasThis] UI1 System.Data.SqlClient.TdsParser.ReadByte() 0x0604f568 0x069ced7a [DEFAULT] [hasThis] Boolean System.Data.SqlClient.TdsParser.Run(ValueClass System.Data.SqlClient.RunBehavior,Class System.Data.SqlClient.SqlCommand,Class System.Data.SqlClient.SqlDataReader) 0x0604f5b0 0x06aa3c79 [DEFAULT] [hasThis] Void System.Data.SqlClient.SqlDataReader.ConsumeMetaData() 0x0604f5c4 0x06aa3c0e [DEFAULT] [hasThis] SZArray Class System.Data.SqlClient._SqlMetaData System.Data.SqlClient.SqlDataReader.get_MetaData() 0x0604f5cc 0x06aa1be7 [DEFAULT] [hasThis] Class System.Data.SqlClient.SqlDataReader System.Data.SqlClient.SqlCommand.ExecuteReader(ValueClass System.Data.CommandBehavior,ValueClass
<cut for brevity>
|
So, the managed stack for the thread 11 seems to be waiting on SQL connection. The native stack would be at a ntdll!NtWaitingForSingleObject() function which basically means this thread is doing nothing but waiting. So let’s examine a little further to see if we can identify which results are we waiting from the SQL Server:
|
0:011> !dso Thread 11 ESP/REG Object Name eax 0x10073b34 System.Collections.Hashtable ecx 0x10073b34 System.Collections.Hashtable 0x604f4c8 0x1801dad4 System.Byte[] 0x604f4d0 0x28061840 System.Data.SqlClient.TdsParser 0x604f4d4 0x1801dad4 System.Byte[] 0x604f4f8 0x28061840 System.Data.SqlClient.TdsParser 0x604f500 0x28090704 System.Data.SqlClient.SqlDataReader 0x604f504 0x28061840 System.Data.SqlClient.TdsParser 0x604f51c 0x28061840 System.Data.SqlClient.TdsParser 0x604f54c 0x2808fd28 System.Data.SqlClient.SqlCommand ß this is what we’re looking for
<cut for brevity>
|
Let’s taker a closer look at this specific object:
|
0:011> !do 0x2808fd28 Name: System.Data.SqlClient.SqlCommand MethodTable 0x06854dd8 EEClass 0x0657fb60 Size 64(0x40) bytes GC Generation: 0 mdToken: 0x02000194 (c:\windows\assembly\gac\system.data\1.0.5000.0__b77a5c561934e089\system.data.dll) FieldDesc*: 0x068545c4 MT Field Offset Type Attr Value Name 0x79b80d44 0x40000b6 0x4 CLASS instance 0x00000000 __identity 0x0649824c 0x4000502 0x8 CLASS instance 0x00000000 site 0x0649824c 0x4000503 0xc CLASS instance 0x00000000 events 0x0649824c 0x4000501 0 CLASS shared static EventDisposed >> Domain:Value 0x00099230:NotInit 0x000dfad8:0x100144d8 << 0x06854dd8 0x4000eac 0x10 CLASS instance 0x10032b90 cmdText ß this string seems promising 0x06854dd8 0x4000ead 0x24 System.Int32 instance 4 cmdType 0x06854dd8 0x4000eae 0x28 System.Int32 instance 3600 _timeout 0x06854dd8 0x4000eaf 0x2c System.Int32 instance 3 _updatedRowSource 0x06854dd8 0x4000eb0 0x38 System.Boolean instance 0 _inPrepare 0x06854dd8 0x4000eb1 0x30 System.Int32 instance -1 _handle 0x06854dd8 0x4000eb2 0x14 CLASS instance 0x28090450 _parameters 0x06854dd8 0x4000eb3 0x18 CLASS instance 0x2808f624 _activeConnection 0x06854dd8 0x4000eb4 0x39 System.Boolean instance 0 _dirty 0x06854dd8 0x4000eb5 0x3a System.Byte instance 0 _execType 0x06854dd8 0x4000eb6 0x1c CLASS instance 0x00000000 _cachedMetaData 0x06854dd8 0x4000eb7 0x34 System.Int32 instance -1 _rowsAffected 0x06854dd8 0x4000eb8 0x3b System.Boolean instance 0 designTimeVisible 0x06854dd8 0x4000eb9 0x20 CLASS instance 0x00000000 _transaction
|
Looking at such an instance:
|
0:011> !do 0x10032b90 Name: System.String MethodTable 0x79b7daf8 EEClass 0x79b7de44 Size 80(0x50) bytes GC Generation: 2 mdToken: 0x0200000f (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll) String: dbo.[spGenerateOptions] ß Bingo!!! This is the stored procedure we’ve been waiting on to return FieldDesc*: 0x79b7dea8 MT Field Offset Type Attr Value Name 0x79b7daf8 0x4000013 0x4 System.Int32 instance 31 m_arrayLength 0x79b7daf8 0x4000014 0x8 System.Int32 instance 30 m_stringLength 0x79b7daf8 0x4000015 0xc System.Char instance 0x64 m_firstChar 0x79b7daf8 0x4000016 0 CLASS shared static Empty >> Domain:Value 0x00099230:0x28000224 0x000dfad8:0x28000224 << 0x79b7daf8 0x4000017 0x4 CLASS shared static WhitespaceChars >> Domain:Value 0x00099230:0x28000238 0x000dfad8:0x2000364c <<
|
So we know that at the point this dump was taken we were just waiting at the stored procedure spGenerateOptions to complete it’s execution at a SQL Server.
We are going now to look at the subsequent dump (taken 30 seconds later) to verify whether we have any progress or not on that thread, however let’s take a loot at one last thing on this dump before starting with the other one (this can save us a lot of time here):
|
0:011> !runaway
User Mode Time
Thread Time
13:d5c 0 days 0:00:37.500
14:c00 0 days 0:00:24.718
16:9c8 0 days 0:00:16.406
15:9e8 0 days 0:00:13.562
11:bb8 0 days 0:00:12.015 ß This is the CPU time spent by the thread 11 at this point
|
So we know that the thread 11 have had the CPU for the aprox. 12.015 seconds when this dump was generated. So after opening the second dump, before jumping on any deeper analysis, lets just run the same command and verify if the thread 11 made any progress.
|
0:011> !runaway
User Mode Time
Thread Time
13:d5c 0 days 0:00:37.500
14:c00 0 days 0:00:24.718
16:9c8 0 days 0:00:16.406
15:9e8 0 days 0:00:13.562
11:bb8 0 days 0:00:12.015 ß The time is exactly the same 30 seconds later
|
Ok, so there is not a single additional millisecond being spent by the thread 11 on the second dump. At this point we don’t even need to take a look at the thread stack to know it’s exactly the same. In other words this thread has been doing nothing but waiting on that stored procedure to execute at the SQL Server.
The problem has been isolated as a database performance issue (mission accomplished!!! J). The next step is work on the database server (running in another box) to understand why this sp is taking too long to execute and, of course, properly addressing the issue to provide a definitive solution. IIS ended up just being the server where the symptom is faced and not the one where the root cause of the problem is really happening.