Microsoft.com Operations

We are the operations team that runs the Microsoft.com sites.

What-Ya-Got-Too-Much-of Stew

What-Ya-Got-Too-Much-of Stew

  • Comments 1
  • Likes

Many stories of the popular author Patrick McManus, contain references to a mysterious concoction, “Whatchagot Stew”.  The recipe for this stew is summarized as being whatever people have on hand, tossed together and boiled for some period of time.  The name being derived from how you start cooking it, someone askes “What’s for dinner?” and the reply is: “I don’t know, what-cha-got?”.  It is an unfortunate fact that many websites in the world (including Microsoft.com) were and are produced in a similar manner.  Of course, this was never the goal started out with.  No, people always have lofty goals of pristine sites that stay perfectly up to date because of their ease of moving customers on to the cool new stuff.  The reality is that people don’t move and content with a plan for end-of-life is an impossible dream for most web sites.  While the approach of “letting sites grow by only adding and never removing” simplifies many of the design and implementation requirements, it also tends to produce a system that is nigh impossible to keep running.  These systems usually prove to be difficult to understand, debug, and improve from an “abilities” point of view (reliability, availability, performancability… ok that is a bit of a stretch, but you get the idea).

 

Though system administrators everywhere should be pushing for engineering excellence in the content exposed to customers (ie. end-of-life plans for all content), we need to realize that sometimes you just have to eat what is in front of you. Since the website stew as a whole is nearly impossible to handle, the real question is how do you break the content up into easily digestible (debug-able) pieces?  One of the rules of thumb to rely on is: “Worst is First”.  Meaning you need to determine what has the worst impact on your system, fix that and then find the new worst thing (note, sometimes the same thing is still worst and needs to be worked on again). 

  

So, what to do first? 

 

One of the easiest ways to start an investigation on the impact of pages to the server is to use Event Tracing for Windows (ETW).  If you don’t know much about it, then you should go watch the great presentation Chris St. Amand did during the Debug Technet Week.  He goes into much more depth about how to use ETW, as well as giving other uses.  

 

For the use related to the topic of performance analysis, there are just a few easy steps and happily they were covered in Chris’ presentation (click here for the .zip file).  We start by creating a file to contain the definitions for what we want to trace.  Chris named it iistrace.guid and the content of the file is:

 

     {1fbecc45-c060-4e7c-8a0e-0dbd6116181b} 0xFFFFFFFF 5 IIS: SSL Filter

     {3a2a4e84-4c21-4981-ae10-3fda0d9b0f83} 0xFFFFFFFE 5 IIS: WWW Server

     {06b94d9a-b15e-456e-a4ef-37c984a2cb4b} 0xFFFFFFFF 5 IIS: Active Server Pages (ASP)

     {dd5ef90a-6398-47a4-ad34-4dcecdef795f} 0xFFFFFFFF 5 HTTP Service Trace

     {a1c2040e-8840-4c31-ba11-9871031a19ea} 0xFFFFFFFF 5 IIS: WWW ISAPI Extension

     {AFF081FE-0247-4275-9C4E-021F3DC1DA35} 0xFFFFFFFF 5 ASP.NET Events

We then use startlogiis.bat which is a simple wrapper around logman – the system tool that manages tracing as well as automated performance counter collection.  The contents of startlogiis.bat are:

logman start "NT Kernel Logger" -p "Windows Kernel Trace" (process,thread,disk) -ct perf -o krnl.etl -ets

logman start "IIS Trace" -pf iistrace.guid -ct perf -o iis.etl –ets

 

While we are only investigating IIS related pages, in this case we enable the kernel tracing as well so that we can compare system resource usage which is stored in the kernel trace.

 

After letting the server take load for a reasonable amount of time (a few minutes to a few hours – depending on the size of files you want to deal with), we need to stop tracing.  Again Chris gave us a nice package for doing so in stoplogiis.bat:

 

logman stop "IIS Trace" -ets

logman stop "NT Kernel Logger" -ets

 

Now that we have produced the .etl files, we can go ahead and produce a pretty view of the current server performance using the logman command:

 

            tracerpt iis.etl krnl.etl -o output.csv –report report.htm -summary -f html

 

This is the first place where I didn’t just use the good stuff provided by Chris.  In his example, he used the text output format, which I find a little harder to read than the pretty HTML format.  The preference is really up to you.  If you want to see the text format, simply leave off the last parameter and give a different parameter (or no parameter) to –report.  Note, workload.bat does this for you.  If you want to automate this process, then the XML output might be what you are looking for. 

 

It should also be noted that we have been working only with tools found in the box.  If you want to make analysis even easier, you can get System Performance Advisor which will turn this into button clicks.

 

Stir the Pot

 

Now that we have an HTML file with our performance data, it is time to stir the pot and get the stew bubbling.  By this I mean we need to see which pages are more costly.  It is actually pretty simple, just scroll down till you see the second titled “URLs with the Most CPU Usage”.  You should see the full URL and be able to go talk to the owner and ask for an improvement.

 

You also may want to take special note of URLs that come up high in the Most CPU Usage section, but don’t appear to high in the “Most Requested URLs” section.  These are good candidates for some quick performance fixes that will impact your site.  Also of interest would be the Slowest URLs and the Most Bytes Sent - though these fall more into impacting the spice of the Performance Stew – the end user perception of performance - which will be the topic of our next performance blog entry.

  

Check the Ingredients

 

One of the dangers of using ETW analysis on the live site is the fact that some of the pages that impact performance the worst might not be hit during the trace.  Or they might not be used in the way that makes them use resources the most.  One way to help this data be the best it can be is to use ETW tracing in the testing environment while under known load.  That way you can focus on different URLs or applications rather than on the site as a whole.  It does mean setting up stress and test work, but it could be well worth your while if you can squeeze a few percent points off of your CPU usage number.  As an example, we recently noticed that we had some heavily used pages that were wasting about 6% of our CPU handling something they shouldn’t be handling.  One configuration change later (not even a code change) and we see the CPU usage average drop by 5-6 points.  Very nice to have that CPU back for the next wave of applications.

 

Also, if you use the approach I outlined above, you only get the top ten offenders for each of the different categories.  For example you only get the top ten CPU users.  However, if you use SPA, you can get a much larger data set (up to 100 I believe).  This is important, since what you want to do is look for pages using a lot of CPU, but aren’t necessarily on the most hit list.  

  

Go ahead, taste some

 

Unlike true Whatchagot Stew, you shouldn’t gag trying to use these tools.  In fact the work should go down nice and smooth.  And with a little practice, you should be able to take little bite sized pieces improvements in the performance of your site, and on a regular basis.

Comments
  • If you're copy/pasting, please note that there's a typo (an emdash used instead of a hyphen) in one of the commands in this article.  The correct string is:

    tracerpt iis.etl krnl.etl -o output.csv -report report.htm -summary -f html

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment