Debugging slow response times in SQL Server 2008

Introduction

The other day I came across a nice article written by Mario Broodbakker called "SQL Server Wait Events: Taking the Guesswork out of Performance Profiling" (see references) on how to troubleshoot slow response times in SQL Server 2000 and 2005 using the YAPP Method. According to the author, one of the biggest challenges to apply this method in SQL Server is that fact that the wait statistics are collected and exposed at the server level (server wide):

... "This is simply not enough, unless you have the luxury of testing your application and SQL statements on an isolated server" ..." But, of course, it's often hard or impossible to isolate and replicate production problems on an isolated test server." ...

I decided to write this because in SQL Server 2008 the wait statistics are available at virtually any level, and (at the time of writing) I couldn't find any articles on the new sqlos.wait_info and sqlos.wait_info_external extended events.

YAPP Method

The YAPP Method: "Yet Another Performance Profiling Method" is based on a paper written by Anjo Kolk (see references). The basic idea of the method can be expressed with the following formula:

(1)    Response time = service time + wait time

This method for performance troubleshooting has been around for some time and it is a fairly standard practice when debugging slow response times.

You can get the response time and the service time of a query by looking at sys.dm_exec_sessions, columns total_elapsed_time and cpu_time. Alternatively, you can use SQL trace or the more lightweight extended events in SQL Server 2008. Knowing response time and service time and applying formula (1) , gives you wait time. So, we know how much we waited. However, in order to get the individual wait types we need the new extended events.

Per session wait statistics in SQL Server 2008

In SQL Server 2008 you can get wait statistics for a given session or statement by leveraging the new extended events. Optionally, you can limit the collection to a small set of wait types or collect only waits exceeding certain duration. If your event session is written carefully, you can use this technique on production systems with very little or no performance impact. In fact, SQL Server 2008 ships with a built-in event session, which is collecting wait statistics, and is running by default out of the box.

Here is how to create an event session to collect wait types for session 51. The events will go to the specified trace file:

create event session session_waits on server

add event sqlos.wait_info

(WHERE sqlserver.session_id=51 and duration>0)

, add event sqlos.wait_info_external

(WHERE sqlserver.session_id=51 and duration>0)

add target package0.asynchronous_file_target

      (SET filename=N'c:\wait_stats.xel', metadatafile=N'c:\wait_stats.xem');

 

Here is how to start the event session, run your workload, and then stop the event session:

alter event session session_waits on server state = start;

exec sp_workload;

alter event session session_waits on server state = stop;

 

Here is how to read the trace file:

select * from from sys.fn_xe_file_target_read_file

      ('c:\wait_stats*.xel', 'c:\wait_stats*.xem', null, null)

 

The result is a list of wait events, representing individual waits, having wait type and duration column. In order to make better sense out of this we need to group the events by wait type and calculate the total duration per wait type. The following code can do this:

create view dbo.read_xe_file as

select object_name as event, CONVERT(xml, event_data) as data

from sys.fn_xe_file_target_read_file

('c:\wait_stats*.xel', 'c:\wait_stats*.xem', null, null)

go

create view dbo.xe_file_table as

select

      event

      , data.value('(/event/data/text)[1]','nvarchar(50)') as 'wait_type'

      , data.value('(/event/data/value)[3]','int') as 'duration'

      , data.value('(/event/data/value)[6]','int') as 'signal_duration'

from dbo.read_xe_file

go

 

select

wait_type

, sum(duration) as 'total_duration'

, sum(signal_duration) as 'total_signal_duration'

from dbo.xe_file_table

group by wait_type

order by sum(duration) desc

go

 

Example

Hopefully this example will demonstrate the power of sqlos.wait_info and sqlos.wait_info_external events. I did some measurements by executing the following workload from two connections:

create proc #sp_workload

      as

begin

      SET nocount ON

      CREATE TABLE #t1000k (c2 char(1020));

      BEGIN TRANSACTION

      DECLARE @rows INT = 50000;

      DECLARE @row INT = 0;

      DECLARE @count INT = 0;

      WHILE @row < @rows

      BEGIN

            INSERT INTO #t1000k (c2) VALUES (REPLICATE('a',1000));

            SELECT @row = @row + 1

            IF @count > 10

            BEGIN

                  COMMIT

                  BEGIN TRANSACTION

                  SELECT @count=0

            END

           

            SELECT @count=@count+1

      END

     

      COMMIT

END

go

While running the workload I had IOMeter (see references) so that I can add some disk pressure.

Results for session 51: Apparently, most of the time was spent waiting for space in the log buffer to store log records - wait type LOGBUFFER. For complete list of all wait types with their descriptions search for sys.dm_os_wait_stats in MSDN or Books Online (see references).

Response time Service time Wait time
32619 1828 30791
wait_type total_duration total_signal_duration
LOGBUFFER 9662 66
PREEMPTIVE_OS_FILEOPS 6442 276
PREEMPTIVE_OS_FLUSHFILEBUFFERS 3833 150
PREEMPTIVE_OS_WRITEFILEGATHER 3015 66
SOS_SCHEDULER_YIELD 2481 2480
WRITE_COMPLETION 1971 0
IO_COMPLETION 1183 0
PAGEIOLATCH_UP 551 0
WRITELOG 417 0
PAGELATCH_UP 30 27
PAGELATCH_SH 10 10
PREEMPTIVE_XE_CALLBACKEXECUTE 4 200

 

Results for session 52: The result for this session is very similar. We can tell that most of the time was spent waiting. After accumulating the durations from all waits we get 30726, which is pretty close to the total wait time calculated using sys.dm_exec_sessions (31484).

Response time Service time Wait time
33890 2406 31484
wait_type total_duration total_signal_duration
LATCH_EX 16555 32
LOGBUFFER 10796 28
SOS_SCHEDULER_YIELD 2739 2729
WRITELOG 605 0
PAGELATCH_UP 26 25
PAGELATCH_SH 5 5

 

For the sake of completeness I also collected server-wide wait statistics, which are also available in previous version of SQL Server. The server-wide statistics clearly shows that we have IO bottleneck however you don't know exactly which session or query was affected and by how much:

 

wait_type wait_time_ms signal_wait_time_ms
LOGBUFFER 21666 127
WRITELOG 17118 4
LATCH_EX 16561 34
PAGEIOLATCH_UP 7891 0
SOS_SCHEDULER_YIELD 7058 7038
PREEMPTIVE_OS_FILEOPS 6443 0
WRITE_COMPLETION 5896 4
SLEEP_BPOOL_FLUSH 3916 278
PREEMPTIVE_OS_FLUSHFILEBUFFERS 3833 0
PREEMPTIVE_OS_WRITEFILEGATHER 3016 0
IO_COMPLETION 1680 0
PREEMPTIVE_OS_REPORTEVENT 141 0
PAGELATCH_UP 139 128
PREEMPTIVE_OS_PIPEOPS 128 0
PREEMPTIVE_XE_TARGETINIT 85 0
XE_BUFFERMGR_ALLPROCESSED_EVENT 70 0
PAGELATCH_SH 23 23
PREEMPTIVE_OS_QUERYREGISTRY 16 0
PREEMPTIVE_XE_CALLBACKEXECUTE 8 0
PAGELATCH_EX 7 7
PREEMPTIVE_XE_SESSIONCOMMIT 5 0
LATCH_SH 1 0
PREEMPTIVE_OS_GETDISKFREESPACE 1 0

 

Complete test code

Here is the complete source code of the test:

use tempdb

go

--

-- Enable xp_cmdshell to delete trace files

--

exec sp_configure 'show advanced options', 1

reconfigure

exec sp_configure 'xp_cmdshell', 1

reconfigure

go

BEGIN TRY

      drop proc #sp_workload

END TRY BEGIN CATCH END CATCH

go

create proc #sp_workload

      as

begin

      SET nocount ON

      CREATE TABLE #t1000k (c2 char(1020));

      BEGIN TRANSACTION

      DECLARE @rows INT = 50000;

      DECLARE @row INT = 0;

      DECLARE @count INT = 0;

      WHILE @row < @rows

      BEGIN

            INSERT INTO #t1000k (c2) VALUES (REPLICATE('a',1000));

            SELECT @row = @row + 1

            IF @count > 10

            BEGIN

                  COMMIT

                  BEGIN TRANSACTION

                  SELECT @count=0

            END

           

            SELECT @count=@count+1

      END

     

      COMMIT

END

go

BEGIN TRY

      drop event session session_waits on server

END TRY BEGIN CATCH END CATCH

go

exec xp_cmdshell 'del c:\wait_stats*.xel', no_output;

exec xp_cmdshell 'del c:\wait_stats*.xem', no_output;

go

declare @cmd nvarchar(max);

set @cmd = N'create event session session_waits on server ' +

      N'add event sqlos.wait_info (WHERE sqlserver.session_id=' + convert(nvarchar(10), @@spid) + N' and duration>0) ' +

      N', add event sqlos.wait_info_external (WHERE sqlserver.session_id=' + convert(nvarchar(10), @@spid) + N' and duration>0) ' +

      N'add target package0.asynchronous_file_target (SET filename=N''c:\wait_stats.xel'', metadatafile=N''c:\wait_stats.xem'')';

exec(@cmd);

go

BEGIN TRY

      drop table #snapshot

END TRY BEGIN CATCH END CATCH

go

select total_elapsed_time, cpu_time into #snapshot from sys.dm_exec_sessions where session_id=@@SPID;

go

alter event session session_waits on server state = start;

exec #sp_workload;

alter event session session_waits on server state = stop;

go

select

      (es.total_elapsed_time - ss.total_elapsed_time) as 'Response time',

      (es.cpu_time - ss.cpu_time) as 'Service time',

      ((es.total_elapsed_time - ss.total_elapsed_time) - (es.cpu_time - ss.cpu_time)) as 'Wait time'

      from sys.dm_exec_sessions es, #snapshot ss where session_id=@@SPID;

go

drop event session session_waits on server

go

BEGIN TRY

      drop view dbo.read_xe_file

END TRY BEGIN CATCH END CATCH

go

create view dbo.read_xe_file as

select object_name as event, CONVERT(xml, event_data) as data

from sys.fn_xe_file_target_read_file('c:\wait_stats*.xel', 'c:\wait_stats*.xem', null, null)

go

BEGIN TRY

      drop view dbo.xe_file_table

END TRY BEGIN CATCH END CATCH

go

create view dbo.xe_file_table as

select

      event

      , data.value('(/event/data/text)[1]','nvarchar(50)') as 'wait_type'

      , data.value('(/event/data/value)[3]','int') as 'duration'

      , data.value('(/event/data/value)[6]','int') as 'signal_duration'

from dbo.read_xe_file

go

select

      wait_type

      , sum(duration) as 'total_duration'

      , sum(signal_duration) as 'total_signal_duration'

from dbo.xe_file_table

group by wait_type

order by sum(duration) desc

go

--

-- CLEANUP

--

exec xp_cmdshell 'del c:\wait_stats*.xel', no_output;

exec xp_cmdshell 'del c:\wait_stats*.xem', no_output;

drop view dbo.xe_file_table;

drop view dbo.read_xe_file;

drop proc #sp_workload;

 

References

Sys.dm_os_wait_stats

The following MSDN article shows all wait types with their description: https://msdn.microsoft.com/en-us/library/ms179984.aspx

IOMeter

IOMeter is an IO stress and measurement tool developed by Intel. It is now an Open Source tool, and freely downloadable from:

https://www.iometer.org

SQL Server Wait Events: Taking the Guesswork out of Performance Profiling

A very nice article by Mario Broodbakker on how to troubleshoot slow response times in SQL Server 2000 and SQL Server 2005:

https://www.simple-talk.com/sql/performance/sql-server-wait-events-taking-the-guesswork-out-of-performance-profiling/

Troubleshooting Performance Problems in SQL Server 2005

Here also has a very good presentation on the subject and describes a lot of SQL Server 2005 wait events here:

https://www.microsoft.com/technet/prodtechnol/sql/2005/tsprfprb.mspx