W3C IIS Log Analysis using Log Parser
07 January 10 09:27 PM | clinth@microsoft.com | 0 Comments   

I was recently on the PowerScripting Podcast where we talked about the PAL tool and the challenges with developing it in PowerShell v2.0 such as dealing with very large PowerShell scripts. While on the podcast some of listeners asked about analyzing IIS W3C logs.

PAL doesn’t analyze IIS W3C logs, but I have quite a few Microsoft Log Parser queries that I use when I do IIS Health Checks – an offering from Microsoft Premier Field Engineering (PFE). This seems like a good time to publish some of them. Keep in mind that many of these originally came from the Microsoft Log Parser help documentation.

The following Microsoft Log Parser queries are the ones I typically use when analyzing IIS W3C logs. Many of these queries need additional information passed into Log Parser to work properly. See the Microsoft Log Parser documentation for more information on how to use these queries.

[Chart] Top 20 hit URLs:
SELECT TOP 20 cs-uri-stem, COUNT(*) AS Hits INTO IISLOG_ANALYSIS_TOP20_HITS.GIF FROM MERGED_LOG.CSV GROUP BY cs-uri-stem ORDER BY Hits DESC

[Chart] Top 10 ASP/ASPX URLs:
SELECT TOP 10 cs-uri-stem, COUNT(*) AS Hits INTO IISLOG_ANALYSIS_TOP10_ASPX_HITS_Date_10-06-2009_14-37-56PM.GIF FROM MERGED_LOG.CSV where cs-uri-stem like '%%.asp' or cs-uri-stem like '%%.aspx' GROUP BY cs-uri-stem ORDER BY Hits DESC

[Chart] Hit Frequency each hour:
SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)), COUNT(*) AS Hit_Frequency INTO IISLOG_ANALYSIS_HIT_FREQ_Date_10-06-2009_14-37-56PM.GIF FROM MERGED_LOG.LOG GROUP BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) ORDER BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) DESC

[Chart] Bytes Per Extension:
SELECT EXTRACT_EXTENSION(cs-uri-stem) AS Extension, MUL(PROPSUM(sc-bytes),100.0) AS Bytes INTO IISLOG_ANALYSIS_BYTES_PER_EXT_Date_10-06-2009_14-37-56PM.GIF FROM MERGED_LOG.CSV GROUP BY Extension ORDER BY Bytes DESC

[Chart] Top 20 Client IP Addresses:
SELECT top 20 c-ip AS Client_IP,count(c-ip) AS Count from MERGED_LOG.CSV to IISLOG_ANALYSIS_TOP20_CLIENT_IP_Date_10-06-2009_14-37-56PM.GIF GROUP BY c-ip ORDER BY count(c-ip) DESC

[Chart] Total Unique Client IP's each hour (Users each hour). This requires two queries to create this chart:
Select TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) as Times, c-ip as ClientIP into IISLOG_ANALYSIS_DIST_CLIENT_IP.LOG from MERGED_LOG.LOG group by Times, ClientIP
Select Times, count(*) as Count from IISLOG_ANALYSIS_DIST_CLIENT_IP.LOG to IISLOG_ANALYSIS_HOURLY_UNIQUE_CIP_Date_10-06-2009_14-37-56PM.GIF group by Times order by Times DESC

Number of Errors:
SELECT cs-uri-stem, sc-status,sc-win32-status,COUNT(cs-uri-stem) from MERGED_LOG.CSV to IISLOG_ANALYSIS_ERROR_COUNT_Date_10-06-2009_14-37-56PM.CSV where sc-status>=400 GROUP BY cs-uri-stem,sc-status,sc-win32-status ORDER BY COUNT(cs-uri-stem) DESC

Error Frequency Each Hour:
SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)), COUNT(*) AS Error_Frequency FROM MERGED_LOG.LOG TO IISLOG_ANALYSIS_ERROR_FREQ_Date_10-06-2009_14-37-56PM.GIF WHERE sc-status >= 500 GROUP BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) ORDER BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) DESC

Status Code Percentages:
SELECT sc-status, COUNT(*) AS Times from MERGED_LOG.CSV to IISLOG_ANALYSIS_STATUS_CODE_Date_10-06-2009_14-37-56PM.GIF GROUP BY sc-status ORDER BY Times DESC

Top 20 Average Longest Requests:
SELECT top 20 cs-uri-stem,count(cs-uri-stem) As Count,avg(sc-bytes) as sc-bytes,avg(cs-bytes) as cs-bytes,max(time-taken) as Max,min(time-taken) as Min,avg(time-taken) as Avg from MERGED_LOG.CSV to IISLOG_ANALYSIS_TOP20_AVG_LONGEST_Date_10-06-2009_14-37-56PM.CSV GROUP BY cs-uri-stem ORDER BY avg(time-taken) DESC

Average Response Times in Milliseconds Per Hour:
SELECT TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)), avg(time-taken) INTO IISLOG_ANALYSIS_AVG_RESP_TIME.GIF FROM MERGED_LOG.LOG WHERE sc-status=200 AND (cs-uri-stem like '%%.asp' or cs-uri-stem like '%%.aspx') GROUP BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) ORDER BY TO_LOCALTIME(QUANTIZE(TO_TIMESTAMP(date, time),3600)) DESC

About the Author – Clint Huffman
03 December 09 11:32 AM | clinth@microsoft.com | 0 Comments   

 

Clint Huffman Originally from Dayton, OH. Clint joined Microsoft in 1999 supporting web technologies, Microsoft load testing tools, and later worked as a Testing Consultant helping people with load testing at the Microsoft Services Labs. There he found a passion for solving Windows performance issues. In 2006, he joined Microsoft Premier Field Engineering (PFE) team to take on BizTalk performance issues only to find that most issues could be diagnosed by analyzing operating system resources. Some may say that performance analysis of Windows is more of an art than a science. Well, Clint strives to make it as much of a science as possible and to bring it to the masses. Clint is probably best known for the Performance Analysis of Logs (PAL) tool, which simplifies the analysis of performance monitor logs. He is an author of many of the BizTalk performance guides on MSDN and recently spoke at Tech Ed 2008 on BizTalk performance analysis. He currently lives in Marysville, WA which is about 40 miles north of Seattle.

Vital Signs Workshop: Clint teaches the Vital Signs workshop (originally created by Shane Creamer) which is a 3 day workshop that covers Windows architecture and how to identify performance bottlenecks.

Products: Clint primarily supports Microsoft BizTalk Server, Microsoft Internet Information Services (IIS), and scripting (VBScript or PowerShell).

PAL Tool: Clint wrote the free and open source PAL tool which saves you time by analyzing performance counter logs. You can check out the tool at http://pal.codeplex.com

Clint’s Windows Performance Analysis in the Field Blog: Clint blogs about the Windows performance issues he encounters in the field at http://blogs.technet.com/clinth.

Facebook: http://www.facebook.com/clint.huffmanClintHuffmanGamerPic

Twitter: http://twitter.com/clinth

XBOX Live Gamer Tag: ClintHuffman

Email: clinth@microsoft.com 

Clint_Huffman_tag_email_only

Microsoft Tag: Download Clint’s contact information to your phone. Get the free app for you phone at http://gettag.mobi. Works with Windows Mobile, iPhone, Blackberry, and may other camera phones.

PodCasts: Clint is a regular on the RunAs Radio show which is a free internet radio talk show podcast for IT Professionals talking about Windows performance analysis. Use your Zune Marketplace to subscribe to the podcast for free or go to http://www.runasradio.com.

How to engage PFE: Clint is part of the Microsoft Premier Field Engineering (PFE) team at Microsoft. If you would like to use PFE resources (onsite problem solving, training, risk assessments, etc) and if you have a Microsoft Premier Support Agreement, then contact your Microsoft Technical Account Manager (TAM). If you do not have a Microsoft Premier Support Agreement, then go to http://www.microsoft.com/services/microsoftservices/srv_premier.mspx for more info.

The Case of the Mysterious Black Box
18 November 09 06:54 PM | clinth@microsoft.com | 2 Comments   

I haven’t had any performance analysis challenges lately, but there is a lot of confusion as to how to measure SAN performance. To many, a SAN is a proverbial “mysterious black box” that seems to perplex all who try to measure it’s performance with any measure of certainty. This blog entry covers how I measure the performance of SANs and tries to unlock the mysteries of the black box.

SAN hardware and configurations vary greatly between our customers and SAN vendors. Some carve out disk arrays that are stripped across only a few spindles while others stripe the entire disk array across all of the spindles.

Before I go too deep, let’s get some terminology out of the way. This is *my* interpretation of these terms and may differ from the industry.

Disk Terminology

The following terms are used frequently in the disk industry.

Spindle: A physical hard disk. The *real* disk installed in the hardware.

LUN (Line Unit Number): This is the physical disk representation to the Windows operating system – meaning Windows thinks it is a single spindle, but could be many spindles masked by the hardware.

Physical Disk: The same as LUN in regards to Windows.

Logical Disk: A drive letter or mount point mapped to a physical disk.

SAN’s Consolidate and Maximize Your Disk Investment, but…

SAN’s are popular because they consolidate spindles allowing you to maximize your investment. Imagine you have a large file server, a large database, and a few other high end servers all with high end disk arrays dedicated to them. In many cases, one or more of the large, dedicated disk arrays is under utilized and/or one of them is over utilized. The SAN consolidates all of the spindles into a single location and spreads the load across all of them to maximize utilization. While this is a great way to spread the load, it’s a gamble with performance.

SANs Gamble with Disk Performance

Since SAN’s spread the load across all of spindles in many cases, it is taking a gamble that no single spindle will be contested for any length of time. If that event occurs, then it tries to compensate with cache and other technologies. It does a great job of handling this. Unfortunately, the SANs allow the SAN administrator to hand out *many* LUNs from the same disk array. This ends up becoming more of an optimization of disk capacity versus disk performance. In many cases that I have witnessed myself, I have found so much contention that the LUN performance is reduced to the speed of a floppy drive which is about 900ms!

Disk Analysis Basics

When it comes to analyzing the performance of any disk in Windows, our best approach is the simplest one… monitor the disks for response times. Specifically, monitor “\LogicalDisk(*)\Avg Disk sec/Read” and “\LogicalDisk(*)\Avg Disk sec/Write” for values greater than 15ms (0.015) on average. 5400rpm drives are able to respond within 15ms with no cache in front of it. The spindles typically come with a specification showing their seek times. Effectively, if the response times are greater than the seek times, then the spindle is likely falling behind. With that said, there might be a more advanced reason for the high response times. For example, what if the bytes per I/O is 2MBs? That would explain the long response times.

My PAL tool checks and throws warning alerts for response times greater than 15ms and critical alerts for response time greater than 25ms.

Response times measured by “\LogicalDisk(*)\Avg Disk sec/Read” and “\LogicalDisk(*)\Avg Disk sec/Write” is our most authoritative and primary indicator of disk performance. With that said, it can’t tell us the entire story. When you see high disk response times, don’t go running to the SAN administrators(s) with guns a blazing. Work with them on their terms by providing details of your research. This shows that you are trying to help and that you understand their needs.

The Needs of the SAN Administrator

SAN administrators like to have the following details because these are metrics that they can typically do something about. Try to gather this information over a long period of time and have a chart of graph prepared when presenting your findings.

· Disk Read and Write response times which you can get from “\LogicalDisk(*)\Avg Disk sec/Read” and “\LogicalDisk(*)\Avg Disk sec/Write”. For example, Microsoft BizTalk Server database LUNs should get less than 15ms sustained response times for both read and write I/O.

· IOPS (I/O’s per second) which you can get from “\LogicalDisk(*)\Disk Transfers/sec”. This helps with determining how many dedicated spindles are needed to supply the demand for IOPS for the LUN. For example, BizTalk database LUNs should get a least 200 IOPS or more sustained – 500 IOPS or more preferred. Why? Because my 5400rpm USB drive can do 200 IOPS of random write I/O with 5ms response times on average. Why have a SAN if a USB drive can out perform it?

· Bytes per I/O which you can get from “\LogicalDisk(*)\Avg Disk Bytes/Read” and “\LogicalDisk(*)\Avg Disk Bytes/Write”. The size of the I/O’s on average will tell the SAN administrators how to format the blocksize of disk partitions. For example, BizTalk database LUNs typically do about 16K per read or write operation on average with frequent spikes of 64K. This is why we typically recommend formatting the disks to a 64K blocksize. For more information on blocksizes, see “Disk Partition Alignment Best Practices for SQL Server”

· Ratio of Reads to Write which you can get from “\LogicalDisk(*)\Disk Reads/sec” and “\LogicalDisk(*)\Disk Writes/sec”. SAN cache can typically be adjusted for a read/write ratio and this information helps to adjust that setting. For example, BizTalk database LUNs for the MsgBoxDb is about 50% reads and 50% writes while the DTADb is about 99% writes and 1% reads.

· What is using the disks which you can get using Process Monitor (a Mark Russinovich SysInternals tool owned by Microsoft), Microsoft xPerf (part of the Windows Server 2008 Performance Toolkit), or from Resource Monitor which is part of Windows 7. This helps determine if the server is really using the disks or if anti-virus or backup software is really killing them unnecessarily. In any case, it’s important to know if the I/O is mission critical versus routine or unnecessary I/O. See my earlier blog post, “The Case of the Relentless Cookie Monster”, for more information on how to use the Process Monitor tool for disk analysis.

Once you have the information above and present it to the SAN administrators, they should be able to configure the SAN to meet these needs.

Conclusion

SANs are critical assets to disk management, but it seems that many disks presented to mission critical systems just can’t perform when the SAN is over saturated or not optimized for the kind of I/O the server needs. SAN administrators don’t instinctively know what the needs of your system are, so provide them with the information they need using the counter metrics mentioned above. I hope this helps to remove the veil of the mysterious black box.

Thanks again to Shane Creamer who originally taught me the disk analysis basics.

Windows 7 and Disk Analysis

Oh, and by the way, Windows 7 and Windows Server 2008 R2 have an incredible performance tool called the Resource Monitor. Just go to Task Manager and click the Resource Monitor button. Click the Disk tab and you will see the processes most active with disk I/O and their average response times to each file they are using. Imagine a SQL Server showing average response times to specific MDF and LDF files. This was a feature I asked for and Microsoft listened.

The Case of the 2 Million Context Switches
28 October 09 09:26 PM | clinth@microsoft.com | 0 Comments   

This week I was in Denver, CO handling a high CPU performance issue with Microsoft BizTalk Server 2006 R2. The customer was seeing very high CPU utilization even when BizTalk doesn’t have any real work do to, so this was a bit of a mystery and a challenge. I like challenges. ;-)

Symptoms: High processor utilization over 90% across all 8 processors.

Relevant Configuration

  • Microsoft Windows Server 2003 (x64)
  • Microsoft BizTalk Server 2006 R2 (x64)
  • 2xQuad Core (8 processors) AMD Opteron

First, I verified that the server is in fact using a high amount of CPU by taking performance counter logs of the servers. Sure enough the BizTalk Servers were having high CPU consumption.

CPU

The black block is just removing the server names to protect privacy.

By the way, it’s worth mentioning that the above chart was created using the upcoming release of Performance Analysis of Logs (PAL) v2.0 tool. I’ve been working on it nearly every night for the past year. Look for it to be released soon at Microsoft’s open source web site at http://pal.codeplex.com. Got a performance issue, PAL it!

The next step to diagnosing high processor conditions is to determine the ratio of privileged mode to user more CPU. Threads executing on processor can only be in one of these modes and this is the fork in the road that determines our next step in troubleshooting. The counter “\Processor(?)\% Processor Time” is the sum of “\Processor(?)\% Privileged Time” (also known as kernel time) and “\Processor(?)\% User Time” where the question mark (?) indicates the same counter instance such as _Total and 0.

Using PAL v1.0 (PAL v2.0 doesn’t have % Privileged Time analysis functional yet) to generate chart and alerts on the counters, it showed that the % Privileged Time counter is taking up roughly 40% of the total CPU. Under normal conditions, privileged time shouldn’t be higher than 10% of total CPU, so this is a very high amount of time being spent in the kernel.

ProcessorPercent_Privileged_Time_0

A high amount of kernel mode time typically indicates a high amount if I/O such as network, disk, memory, threads and/or hardware drivers.

PAL did not alert on any disk or network I/O, but it did light up like a Christmas tree on “\System\Context Switches/sec”!

System_Context_Switches_sec

Again, the black box in the image above is removing the server name to protect privacy.

A normal amount of Context Switches/sec is about 1,000 or less. Today’s processors are able to handle high amounts of context switches per second and the chart above (done with PAL v2.0 I’m proud to say) shows over 2 million context switches/sec. This is a testament to my previous statement. In my entire career, I have never seen Context Switches/sec go above 100,000, so this is the most extreme case I have ever seen. So… as they say in Mythbusters… “Now, there’s your problem!”.

According to Performance Monitor, “Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service.”

Since I/O is a common reason for threads to context switch into kernel mode, I decided to get a Process Monitor (a SysInternals tool owned by Microsoft) trace on the server. It revealed that the BTSNTSVC64.exe process is reading from the BTSNTSVC64.exe file extremely often. Here is the view of Process Monitor showing the high disk I/O on BTSNTSVC64.exe.

clip_image009

Processor utilization in the BizTalk processes (BTSNTSVC64.exe) correlates with the high CPU and disk and overall these processes are the processes taking up the most CPU on the server.

CPUUsagePerProcess

Multi-threaded applications like BizTalk Server can use up to 100% times the number of processor on the server, so since this server has 8 processors any one process could potentially take up to 800% CPU. This is why we see “\Process(BTSNTSvc64)\% Processor Time” consuming 207% CPU.

At this point, we have proof that BizTalk Server processes are consuming the most CPU and most of that CPU consumption is in privileged mode (kernel mode) CPU. The processes are hitting the BTSNTSvc.exe file extremely often, and the most likely process doing all of the context switches/sec.

What do we do now? The next logical step is to profile the process to determine what it is doing. There are many profilers out there and different techniques. The first we tried is not a dedicated profiler, but works on native (non-.NET code). We tried Process Explorer (a SysInternals tool owned by Microsoft). It showed the BizTalk processes as consuming high CPU. A nice feature is to go to properties on the process and click the Threads tab.

HighBizTalkCPU_BaseThreadStart 

If you have the Debugging Tools for Windows installed, then you will see the threads and the functions that each of the threads are currently working on. In this case, we see all of the threads on Base ThreadStart indicating that there is seems to be a lot of thread creation or the call stack is not correct.

I mentioned the high CPU symptoms to my BizTalk colleagues Todd Uhl and Brian Gregor.  Both of them pointed me to the article below. Thanks again, guys!

FIX: A CPU usage spike occurs on all the BizTalk hosts after you install security bulletin MS07-040 on a server that is running Microsoft BizTalk Server 2006 or Microsoft BizTalk Server 2006 R2
http://support.microsoft.com/?id=943165

After applying this hotfix (943165) on the servers, CPU utilization on the BizTalk servers dramatically came down to about less then 5% compare to nearly 100% CPU consumption. Therefore, this hotfix was the *big fix* and it sounds like it should be applied to ever x64 BizTalk Server 2006 R2 server especially if you are seeing unnaturally high CPU consumption. With that out of the way, we still had one BizTalk server still using up 25% CPU across all 8 processors, so we focused our attention to it next.

We tried running the stand alone Visual Studio Profiler tool to sample the BizTalk process. Unfortunately, we were not able to get it working due to a conflict with a third-party profiler already installed on the server. :-(

Take a look at one my of older blog postings where I profiled IIS and check out my HowTo article on how to use this tool in a production environment.

Finally, the old hammer way which is still useful is to get full process dumps of the high CPU process. You do this by installing the Debugging Tools for Windows (safe to install on production servers) and using the ADPlus tool to create hang dumps. Each dump takes an image of the process and dumps all of it’s memory to disk in a DMP file. This procedure is similar to taking a picture of someone doing work, then assuming that is what they are doing with all of their time. This is why profilers are better because profilers sample the threads periodically similar to having a video camera on the person doing work versus taking pictures here and there.

We took about 5 process dumps of the BizTalk process by opening a command prompt, changing directory to the installation directory of the Debugging Tools for Windows, then running the following command:

CScript ADPlus.vbs –hang –p 6444 –o C:\DebugOutput

This pauses the process and writes all of it’s memory to disk in a DMP file under the directory specified by the “-o” parameter. 6444 is the process ID of the BizTalk process. This typically takes less than 30 seconds to dump the process out and the process is resumed where it left off like nothing happened.

We took about 5 process dumps about 30 seconds a part from each other. The idea is to compare the dumps to get an idea of what the threads are doing when at high CPU. Ideally, we want to see the same call stack many times to establish what the threads are likely doing. For example, if you took a picture of someone jumping, then you wouldn’t want to assume that they are suspended in mid-air all day.

By the way, I’m proud to say that I am good friends with the original author of ADPlus.vbs, Robert Hensing.

The procedure of taking full user dumps to diagnose proved to be very useful because the call stacks on the active threads (threads doing real work) was identical in every dump. We analyzed them by using them up in WinDBG which is part of the Debugging Tools for Windows.

In WinDBG, we set our symbol path to point to Microsoft’s public symbol server using the following path:
SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

Next, we opened up each DMP file and executed the “!runaway” command (without the double quotes) which shows us which threads have consumed the most CPU over the life of the process.

The following output is similar to what we found.
0:000> !runaway
User Mode Time
  Thread       Time
  57:1418      0 days 16:35:51.812
  56:1410      0 days 3:15:13.171
  69:14b0      0 days 0:01:32.546
  68:14ac      0 days 0:01:21.984
  44:f3c       0 days 0:01:12.406
  42:3b4       0 days 0:01:06.562
  70:14b4      0 days 0:01:02.781
  49:135c      0 days 0:00:59.953
  10:11f8      0 days 0:00:53.734
   6:f88       0 days 0:00:52.015
   5:1e4       0 days 0:00:51.500

This shows that thread 57 and 56 are the threads that have consumed the most CPU.

To change to the thread we type in “57s”. This is similar to changing into a directory at the command prompt. After that we typed in “kb” which shows us the call stack of the thread. In this case, the call stack we found looked something like this (modified for simplicity and privacy):

Custom.Method1
Custom.Method3
Custom.Method2
Custom.Method1
Custom.Method3
Custom.Method2
Custom.Method1
BTS.Streaming.MarkableForwardOnlyEventingReadStream.ReadInternal
BTS.Streaming.EventingReadStream.Read
BTS.Component.XmlDasmComp.HasData

In this call stack, we see a repeating pattern of Method1 calling Method2, Method2 calling Method3, and Method3 calling Method1. I am certainly not a very good BizTalk developer, but there is a repeating pattern here. At this point, we discussed our findings with the customer’s BizTalk developers and with luck try to optimize it.

Conclusion

The BizTalk hotfix 943165 had the most significant impact on the CPU consumption on the BizTalk Server and was the big fix in this case. With that said, most high CPU problems require digging into call stacks to identify what the threads are doing most of the time. There are quite a few third party profilers out there that make this process relatively painless, but be careful about using them in production environments and stick to sampling based profilers if you can. They provide better results in production environments.

Lesson Learned

If you are running the x64 version of Microsoft BizTalk Server 2006 R2, then you need this hotfix (943165).

I hope this posting is helpful with showing you how to look at call stacks using WinDBG and remember… all performance problems can be solved, but like city traffic, it just takes time and effort.

The Case of the Add-on Crashers
03 September 09 09:37 PM | clinth@microsoft.com | 2 Comments   

While on "vacation" in Ohio last month, a good friend of mine called me up saying that Internet Explorer is crashing and unusable, so I asked him to bring his computer over for a look.

The Problem/Symptoms

First, I wanted to confirm what he was saying, so I opened up Internet Explorer. Sure enough, it crashed immediately – meaning the process had a fatal, second chance exception and it crashed the process.

Relevant Technical Details

Operating System: Windows XP Home Edition 32-bit
Application(s): Microsoft Internet Explorer 8

Troubleshooting

Internet Explorer has the ability to load third party “add-on'” to it to enhance your experience while browsing with things such as a search engine toolbars like Bing, client-side code execution engines like Java, or specialized button toolbars like Windows Live. Since the core code of Internet Explorer is relatively stable, I suspected Internet Explorer (IE) Add-ons as the cause of the crash.

To prove or disprove IE Add-ons, I navigated through the Programs menu until I came across an icon named Internet Explorer (no add-ons). This time it opened up without crashing proving one of the Add-ons is causing the crash.

To fix this, I edited the Internet Explorer options through the Control Panel and navigated to the Internet Options icon. This brought up the Internet Properties of Internet Explorer.

IE_Options

On this dialog box, I clicked the Manage add-ons button which brought up another menu showing me all of the add-ons that are installed. At this point, it was just a matter of trial and error. I disabled each one at a time and tried to open IE. Once it finally worked without crashing, I knew which one it was. In this case, it was caused by an anti-virus search add-on.

Another way to troubleshoot this would be to get a crash dump of the process. This can be done through tools like Microsoft WinDBG or ADPLus from the Microsoft Debugging Tools for Windows a free download from Microsoft.com or the DebugDiag tool which is also a free download from Microsoft.com. Once a dump file (*.dmp) file was captured from the crashing application, it would just be a matter of walking the stack to see the faulting stack.

In conclusion, if Internet Explorer seems slow, hangs, or crashes, then it’s most likely due to a third party add-on that you installed.

I hope this helps.

The Case of the Enormous Page File
03 September 09 11:57 PM | clinth@microsoft.com | 0 Comments   

Introduction

It seems like every time I deliver the Vital Signs workshop (a Windows architecture course focused on performance analysis patterns) lately, we always have a great discussion about proper sizing of the page file. There is a lot of bad information floating around on the internet about how to size the page file on servers with large amounts of RAM, so this seemed like a great topic to talk about in my blog.

First of all, don’t take my word on this subject – go to the source – David Solomon. David Solomon is one of the people closest to the Windows source code and he specializes in operating systems. In addition he and Mark Russinovich are the authors of the Windows Internals book 5th edition which just released last month which many of us consider to be *the* book on Windows architecture. This edition of the book is focused on Windows Server 2008/Vista and Windows 7.

David Solomon has a very good webcast on Troubleshooting Windows Memory. It’s a 90 minute webcast, but it took me 8 hours to go through it because I had to pause every few seconds and think about what he just said. ;-) This was a very enlightening experience for me because it really brought everything together for me. One of the topics he discusses is proper sizing of the page file. I will try to summarize his teaching in this blog entry and supplement it with a real case.

In addition, Mark Russinovich has a great post on Pushing the Limits of Windows: Virtual Memory which goes deep into the relationship of virtual and physical memory.

Memory States

Memory can be in one of three states: Free, Reserved, or Committed. Free memory is memory that is free for reuse by the operating system. Reserved memory is memory that an application intends on using, but is not using it yet – similar to reserving a table at a restaurant. Committed memory is memory that has data in it and is being used similar to sitting at the table you reserved earlier.

Reserved and Committed Memory

Let’s say you reserved a table for 10 at your favorite restaurant for tomorrow night. Do you think that the restaurant will cordon off the table until your get there? Of course not. They will allow their customers who are there right now to use that table until your party actually arrives.

The memory manager in Windows works the same way with reserved memory where it will not actually allocate any real memory until an application actually needs it. When memory is actually being used (Committed memory) the operating system must do something with that memory, therefore it will either put it in RAM or on the page file.

What is in the Page File?

The page file only contains data that cannot be found elsewhere on disk. For example, if I open up Notepad and do nothing else, the memory being used by Notepad will never be paged to the page file because the EXE and DLL’s needed to load Notepad can be retrieved from the disk at anytime. If you typed the sentence, “The quick brown fox jumps over the lazy dog.” into Notepad, then that text is in RAM right now, but could eventually be paged to the page file because it cannot be found anywhere on the disk. This is why you will never see EXE’s or DLL’s in the page file – only data that cannot be found elsewhere on disk could be put in the page file.

The Commit Limit

The Commit Limit of your computer is the total amount of RAM and the size of all of the page files on the system. In the case of my laptop, I have 4GBs of RAM and a 4GB page file giving my system a Commit Limit of 8GBs. This means that my laptop can handle up to 8GBs of committed memory.

The Commit Charge or Committed Bytes is the total amount of memory used/committed by the operating system and all processes running on the computer.

Taskmanager

Shown here in my Windows 7 x64 Task Manager. Highlighted in red is my laptop’s Commit Charge (2074MBs) and it’s Commit Limit (8058MBs). Since the amount of RAM on my system is 4GBs, my system can handle the 2GBs of committed memory with no problem even without a page file.

If the amount of Commit Charge (also known as Committed Bytes) is more than the amount of physical RAM on the computer, then some of that memory has to be in the page file and frequent hard page faults to the page file can cause the system to be slow. The page file is really just an extension of physical RAM allowing the system to handle more committed memory if needed, but just *much* slower than RAM.

Running with No Page File

A common question is, “Can I run my server without a page file and will I gain a performance boost?”. Yes, you can run a server without a page file. Windows has supported the ability to not have a page file for a very long time. Since the operating system only pages out memory that is hasn’t been accessed in a long time, the performance impact should be minimal unless the Commit Charge is greater than the amount of RAM on the system. If the Commit Charge is less than the amount of RAM on the system, then the only performance gain that you *might* get is snappier responses from applications or services that have been idle for a long time and even then it might trivial.

Without going into too much detail about working set trimming, the operating system will page out “old” memory used by idle processes - “old” meaning the pages of memory that haven’t been used/accessed the longest. For example, if you opened up a large Word document and you are now on page 100, then page 1 has likely been removed from RAM (paged out). If you have not modified page 1, then page 1 would not go to the page file because the operating system can always just read the document file off of the disk if it needs to. This will cause a hard page fault, but not against the page file. See my blog post on, “The Case of the Phantom Hard Page Faults”. This behavior will happen regardless of the amount of RAM on the server because the operating system is optimizing the server’s RAM for memory pages that are frequently used.

If you run with no page file, then your Commit Limit (RAM + page file size) is now equal to the amount of RAM on your server. In this case, all it means is that your server cannot exceed a Commit Charge greater than the amount of RAM your server has installed. If the Commit Limit is reached on the system, then the system will be out of memory and will likely hang for long periods of time until the committed memory is released. This can sometimes be as easy as killing the process that has the most committed memory measured by the “\Process(*)\Private Bytes” counter in Performance Monitor.

The Problem/Symptoms

Okay, so now that we have the technical stuff out of the way, let’s put this into a real world case.

Last month, one of my customers had a 64-bit (x64) Windows Server 2003 Server with 32GBs of RAM. They went with the standard page file size of 1.5 times the size of RAM for their page file which gave them a 48GB page file which gave them a Commit Limit (RAM + page file size) of 80GBs! In addition, the page file was taking up an enormous amount of disk space.

Troubleshooting

I fired up Task Manager to see how much Committed memory (Commit Charge) is being used by the system and it was only using about 2GBs. In Task Manager on Windows Server 2003, the Commit Charge is shown as PF Usage which is not the proper name for it. PF Usage should read as “Committed Memory”, but at least it reads as just “Memory” now in Windows 7.

Anyway, the customer’s server was only using 2GBs (Commit Charge) out of 80GBs (Commit Limit) at peak. On Windows Server 2003’s Task Manager there is a Committed Memory Peak value which shows you how much committed memory the system used at peak from the time the server was booted up until now. If the peak ever gets close to the Commit Limit, then you need to increase the size of the page file or identify and kill the leaking (memory hog) process. In any case, my customer’s server was barely using the RAM on the server let alone needing to use the page file.

The Solution

Since the Commit Charge Peak was only at 2GBs and since the amount of RAM on the server was at 32GBs, the customer could run the server without a page file and not have any problems. In fact, they could remove some of the RAM modules from the server and save electricity costs. Ultimately, the customer stuck with the 1.5 times RAM setting just in case. This is a waste of disk space, but is a very safe choice.

With all of that said, we (Microsoft) still *generally* recommends a page file that is at least greater than the amount of RAM installed + 1MB (for the header). This is because by default Windows Servers are set to “Complete Memory Dump” which means to dump all of the physical memory and the page file to a memory.dmp file when a kernel dump (blue screen) occurs. Since a page file will be enormous based on this settings requirements, you can certainly choose a smaller kernel dump setting such as “Kernel memory dump” or “Small memory dump (64K)”. The smaller dump sizes still contain valuable data at a more reasonable disk space cost.

According to the “Windows Internals 4th edition” book, “The disadvantage of minidumps is that to analyze them, you must have access to the exact images used on the system that generated the dump at the time you analyze the dump”. This would certainly be difficult for a Microsoft Support Professional to assist you unless they are physical onsite with you. For a full explanation of these settings and their impact on the system, read the “Crash Dump Analysis” chapter in the Windows Internals book.

In a nutshell, use the smaller “Kernel memory dump” or “Small memory dump” settings if your server does not have a history of crashes (blue screens).

Sizing of the Page File

The reason page files have been difficult to size is because to truly size them properly, the server must be put into production and the Commit Charge peak must be closely monitored. Once you have a baseline of what the system’s commit charge peak is, then you simply need to make the Commit Limit larger than it with a bit of room to grow. Ironically, IT professionals need to set the page file to something *before* going into production, so this is a bit of the chicken and the egg problem. Since we have no idea of how much committed memory your server will use, it made sense to use a 1.5 times RAM as a *very general* guideline.

The Checklist

To *properly* size your server’s page file, follow this checklist:

  1. Kernel Memory Dump Setting: First check your kernel dump setting to see if your server is set to “Complete Memory Dump” (default setting on Windows Servers). If so, your page file must be greater than RAM plus 1MB. This kind of memory dump is extremely large and sometimes wasteful, so consider a smaller memory dump size unless your server has a history of kernel crashes (blue screens). Changing this setting will allow you to set the page file size to a size smaller than the amount of RAM installed which can free up some disk space.
  2. Monitor the Commit Charge Peak: Monitor your server’s Commit Charge peak to determine how much committed memory the system is using at peak. This can be monitored from the Performance tab in Task Manager or by monitoring the “\Memory\Committed Bytes” counter in Performance Monitor.
  3. Adjust the Commit Limit: After monitoring the Commit Charge peak, set the system’s Commit Limit to be larger than the highest monitored Commit Charge peak plus a little extra for room to grow just in case. I recommend it to be at least 10% greater than the maximum observed Commit Charge peak. Remember, the Commit Limit is the total of RAM + the total size of all of the page files on the system, therefore you can adjust it by adding/removing RAM or changing the sizes of the page file(s). The Commit Limit on the server can be monitored either from the Performance tab in Task Manager or by monitoring the “\Memory\Commit Limit” counter in Performance monitor.

Tools Used in this Case

Task Manager: This tool is native to the Windows operating system and can be opened with the key combination of Ctrl+Shift+Esc.

Performance Monitor: This tool is native to the Windows operating system and can be opened by clicking, Start, Run, type “perfmon”, then Enter or click OK.

Process Explorer: While I didn’t use Process Explorer in this case, it is a great tool for monitoring server memory. Process Explorer is a free download from Microsoft Technet Windows SysInternals at http://technet.microsoft.com/en-us/sysinternals/default.aspx or you can run it direct from http://live.sysinterals.com/procexp.exe

Conclusion

In conclusion, the page file is difficult to size because you must monitor the server’s Commit Charge peak over a long period of time while in production. Always make sure that your Commit Limit (RAM + page file size) is at least 10% greater than the Commit Charge peak. And please check out David Solomon’s Troubleshooting Windows Memory webcast. It’s great and free if you have a Windows Live account! Also, check out Mark Russinovich’s blog on Pushing the Limits of Windows: Virtual Memory.

The Case of the Relentless Cookie Monster
23 August 09 06:20 PM | clinth@microsoft.com | 3 Comments   

Introduction

Okay, I'm finally back from my 3 weeks of vacation. We went back to Dayton, Ohio to visit family and friends and had a great time. I wasn't onsite with any customers, but inevitably I had plenty of computer problems to solve. Some people avoid helping friends and family with computer problems, but actually enjoy it. I'll do a few separate blog posts to talk about the adventures I had. In the meantime, here is a good one about my wife’s computer…

My wife wanted to have her genealogy data with her while visiting in Ohio, so I decided reload my spare laptop for her to use. I've been trying to get her to go to Windows 7 for awhile now, so this was a great opportunity to get her to try it out. In any case, getting my wife to go to Windows 7 is not what this blog post is about.

Relevant Technical Data

Operating System: Windows XP Professional 32-bit
RAM: 2GBs
Hard Drives: 1 drive (C:)

The Problem/Symptoms

Before I started the migration Wizard on my wife's desktop computer (Windows XP) to copy her files and settings to her new laptop, I noticed that her desktop system was running very slow. "Slow" as in web sites are taking about 30 seconds or more to come up and overall system performance seemed sluggish such as the Start menu taking about 10 seconds to come up.

Troubleshooting

I fired up Task Manager (Ctrl+Shift+Esc) to see if there is any CPU or memory related issues. In this case, CPU utilization was relatively low (less than 10%) and memory availability looks good (less than 10% committed memory usage). Finally, using Microsoft Performance Monitor I looked at disk I/O. Bingo! Disk response times are very high and the average disk queue length is high.

Perfmon

It is a bit difficult to see in the screenshot above, but Avg. Disk Queue Length is above 1 for a long time. When there is more than 2 I/O’s waiting on a disk spindle for a long period of time, then that is an initial indicator of a disk problem. Be careful about using the _Total instance because it will average out this value across all of the disks. In this case, my wife only has one hard drive on her computer, so I can safely say that the disk resource is the busiest resource.

As a general rule of thumb, you can use Avg. Disk Queue Length as an initial indicator when there is more I/O on average than 2 plus then number of spindles. This means that if this was a RAID0 with 5 spindles (spindles = physical disks behind the RAID), then an Avg. Disk Queue Length greater than 7 on average would be considered bad. In any case, the best performance counters to use as initial disk performance indicators is Avg. Disk Sec/Read and Avg. Disk Sec/Write. They should always be less than 15ms regardless of the disk architecture.

Great! I found disk is the problem won't my wife be proud? ;-) Well, just like my customers, my wife doesn't care if I find a busy resource - she just wants it fixed. :-) So, we can’t stop here.

To identify why the disk resource is so busy, I used one of my favorite tools called Process Monitor by Mark Russinovich. Process Monitor is a SysInternals tool which is owned by Microsoft. It can be ran directly from the internet by browsing to http://live.sysinternals.com/procmon.exe.

I ran Process Monitor and used the Process Activity Summary to tell me which process is causing the most disk I/O. To my surprise, Internet Explorer was causing the most disk I/O.  I practically never see IE causing disk I/O, so this was an interesting result.

ProcmonView

Here is a screenshot of Process Monitor showing iexplorer.exe hitting a cookie text file very hard.

To confirm that this is the problem, Process Monitor has the ability to aggregate the results into a File Summary view or a Process Activity Summary view.

ProcmonFiles

Above is the File Summary view which shows the files that had the most disk I/O. This view shows that Internet Explorer cookies are hitting the disk the most.

The Solution

The solution? I simply closed the browser tab that had the web site up. Now, you are probably thinking, "That’s too simple. Why even blog about it?". Well, the point of my blog entries like this one is to highlight the troubleshooting steps and tools that I use to solve these issues. If this was a high end SQL Server, then my troubleshooting steps would be relatively the same.

Conclusion

Microsoft Performance Monitor and Process Monitor are excellent tools to help identify performance problems on computers. Process Monitor has many other features that I have not documented in my blog yet such as network I/O, registry key access, and my favorite… call stack aggregation, but I’ll try to talk about them in the future.

Please don’t think that cookies have bad performance problems. I can honestly say that this particular issue is extremely rare and I was very surprised to find this. Cookies are normally very low, if any, overhead.

Note: I could have used Microsoft xPerf to solve this issue as well.

I hope you enjoyed this blog entry.

Thank you!

The Case of the Phantom Hard Page Faults
16 July 09 04:33 PM | clinth@microsoft.com | 3 Comments   

I am teaching this week, so I figured I would talk about a case I had a few months ago. I have *plenty* of war stories to share, so I can certainly keep this weekly blog going for a long time. ;-)

I’m a big fan of Mark Russinovich and David Solomon, so you may have noticed that I title my blog entries similar to Mark Russinovich’s blog. Mark’s blog (http://blogs.technet.com/markrussinovich/) is full of great information that I use everyday to troubleshoot customer issues. In my case, I am blogging about the customer issues that I encounter each week.

A few months back, I was taking a look at a performance monitor log of a web server. My PAL tool (http://pal.codeplex.com) alerted me that “\Memory\Pages/sec” is breaking it’s threshold (established by our Vital Signs workshop) of more than 2500 pages per second on average. We consider sustained pages/sec of more than 500 to be a Warning and pages/sec of more than 2500 to be Critical.

MemoryPages_sec_0

As you can see here, the pages/sec counter is very high for a long period of time. This *has* to mean that the server is running out of RAM right?

To confirm that theory, I took a look at Available MBytes (the amount of free physical RAM) expecting to see it to be below 5% of total physical memory.

MemoryAvailable_MBytes_0

To my surprise, it is always above 1GB free, so why are we seeing such a high number of pages/sec?

I sent an email to my fellow Microsoft colleagues and one of them told me that it is likely caused by memory mapped files.

Here is the definition of Pages/sec according to performance monitor: Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory\Pages Input/sec and Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

According to Wikipedia, memory-mapped files are a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor.

In other words, applications like Microsoft Word and Microsoft PowerPoint will not load entire documents into RAM. Instead, they memory map the file, so that when you navigate through the document, it loads portions of the document as needed. The act of loading portions of the document from disk to RAM as a memory mapped file causes a hard page fault which is counted in the pages/sec counter.

When I spoke to the customer about this, they immediately realized that their backup software was running when we observed the high amount of pages/sec. The backup software must have been doing memory mapped files as it went.

Solution: The system was not out of RAM memory when the pages/sec counter was high. It was high because the backup software read in files as memory mapped files which are counted as hard page faults.

Per David Solomon (co-author of the Windows Internals book series), If you want to know for sure if your computer is doing hard page faults due to memory mapped files or to the page file, then run Process Monitor (http://live.sysinternals.com/procmon.exe) with “Enable Advanced Output” checked.

When using Process Monitor filter it to only track disk I/O when pages/sec is high and with “Enable Advanced Output” checked. When done capturing, click Tools, File Summary. If the majority of the disk I/O is to the pagefile.sys, then you might have a low memory issue. Otherwise, go to Tools, Process Activity Summary, to see which process is the most active disk I/O consumer and that process is likely the one doing the memory mapped files.

I learned this by watching his web cast at the following address. This is one of the *best* presentations I have ever seen on how memory really is handled.

David Solomon – Understanding and Troubleshooting Memory Problems
http://www.microsoft.com/emea/spotlight/sessionh.aspx?videoid=64

The Case of the High CPU Web Server
09 July 09 01:39 PM | clinth@microsoft.com | 1 Comments   

This week I was off to Phoenix, AZ again, but for a different customer this time. Phoenix is very hot this time of year – I think it got to about 110 degrees Fahrenheit hot enough to literally cook an egg on the sidewalk and likely burn it. ;-)

When I got the dispatch for this ROSS (Remote On-Site Service), the IIS6 Windows Server 2003 32-bit web server was performing very slowly and was leaking memory very quickly. I was asked to assist with the memory leak and its performance.

Before my flight to Phoenix, I picked up a few perfmon logs from the customer and analyzed them using my PAL tool (http://pal.codeplex.com) to see if we are really dealing with a memory leak. Confirmed! The “\Process(w3wp#4)\Private Bytes” counter showed a steep increasing trend. Here is a chart showing the increasing trend of Private Bytes in one of the IIS worker processes labeled as w3wp_4.

ProcessPrivate_Bytes_3

A steady increasing trend in Private Bytes of a process is an indicator of a memory leak because this is the amount of committed memory that is not shared with any other process. It is normal for processes to consume memory, then release it when it is under less load and it is common for a process to consume memory when it initially starts This is why we look for an increasing trend over a long period of time where the process does not eventually release that memory. In any case, I confirmed the memory leak and the customer also had entries in the Application Event Log of out of memory exceptions (OOM) which means the process eventually ran out of it’s virtual address space - 2GBs in this case because it is a 32-bit server. “\Process(*)\Virtual Bytes” would tell us if the process is close to running out of virtual address space.

Since this is an ASP.NET v2.0 application, I also checked “\.NET CLR Memory(w3wp#4)\# Bytes in all Heaps” to see if it is on an increasing trend as well. Here is the chart.

NET_CLR_Memory#_Bytes_in_all_Heaps_0

As you can see, there might be .NET leaks in other processes, but not the w3wp#4 process that we are seeing the most Private Bytes consumption. This tells me that the leak is not in ASP.NET, but likely due to a native (non-.NET) DLL loaded in memory of this process.

The best tool for memory leak troubleshooting is Microsoft DebugDiag, so I was mentally prepared to use it to troubleshoot this issue. DebugDiag was written by Microsoft Support Professionals to assist customers with memory leaks. It is freely available from Microsoft.com.

Once I arrived onsite and watched the server’s behavior when under load, I found something completely unexpected. The web server went to 100% CPU as soon as significant load hit it. The memory leak still existed, but the web server response times increased to greater than 60 seconds before the worker process ran out of memory. This told me that while the memory leak is certainly a problem to deal with, but it is *not* the cause of the server’s poor performance right now - it’s a CPU issue now. Here is a chart showing the high CPU consumption.

ProcessorPercent_Processor_Time_0

The processors can only be in one of two modes: privileged mode (kernel) or user mode when doing work. High privileged mode CPU issues (more than 30% of CPU time in privileged mode) means the system is spending too much time doing operating system sub routines such as disk I/O, network I/O, thread context switching, bad drivers, etc.

ProcessorPercent_Privileged_Time_0

The chart above shows that there is significant Privileged mode CPU time, but it is less than 30% on average – meaning 70% of the time is in user mode (code processing), so our best bet is to see what process(es) are consuming the CPU and find out what they are running.

image

The chart above shows all of the possible CPU (400% in this case since there is 4 processors) is consumed by w3wp#4. This means we need to profile this process to see what the threads are doing.

Back in the day (and still today), we would get several process dumps of the process and look at the call stacks of the threads. This is similar to taking a picture of an employee working. If you take a picture of the employee at work, then you can get a good idea of what the employee is doing most of the day, right? Suppose the employee was working *very* hard up all night working, and you took their picture when they finally got a chance to sit down – did you properly capture the image of the person’s work ethic? <Brian Regan> I submit you did not!</Brian Regan> Therefore, you take several pictures of the employee throughout the work day hoping to get a good idea what they do all day. This is why we try to capture several process dumps to try to get a relatively accurate idea of what the process is spending all of it’s time on. This method is still valid and used today, but it is invasive (pauses the process when it dumps) and creates large files (DMP files the same size as the private bytes of the process).

By the way, I enjoyed this customer visit a lot because my contact and I were cracking a lot of Brian Regan (comedian) jokes most of the day – I just hope no one was basing our work ethic by our jokes. ;-)

Process profilers are more effective than process dumps in this specific case because they sample the call stacks of the process extremely often. This would be the equivalent of recording the employee with a camcorder all day versus taking pictures of him/her here and there. In this case, we used the Visual Studio 2005 command line Profiler as our tool of choice. We chose the command line version of the profiler because we don’t need or want the full Visual Studio IDE installed on the web server. The command line profiler is able to gather the data, allowing us to analyze it on that server or another computer with the full Visual Studio suite installed for analysis.

We used nearly the exact steps found in the following How To article I wrote:

How To Identify Functions causing a High User-mode CPU Bottleneck for Server Applications in a Production Environment
http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20a%20Disk%20Performance%20Bottleneck%20Using%20SPA&referringTitle=How%20Tos

The one thing that was critical for us, but not covered in the above HowTo guide is that we needed to use the /PACKSYMBOLS parameter, so that the symbols of the application are added to the VSP output file. Symbols are needed to resolve memory offsets to function calls. Maybe I’ll explain symbols in one of my later blog entries.

Here is an example of how to use the PackSymbols parameter to add the symbols to the VSP output file:

vsperfreport /SUMMARY:ALL /packsymbols w3wp.vsp

Originally, the HowTo article (above) was supposed to be in the Microsoft Patterns & Practices Web Performance Testing book, but it was cut because it deals more with performance analysis than testing methodology. I’ve been trying to encourage them to write a book on performance analysis with me, but we’ll see. Maybe all of my blog entries will eventually be compiled into a book.

Due to customer privacy, I cannot and will not post the function calls that we found. With that said, we generally found a lot of time spent in .NET reflection (a method of querying a .NET assembly) and in the creation of dynamic .NET assemblies. Also, please don’t think that I do this alone. I had a lot of help from my PFE colleagues on this one such as Rob Maushardt, Frank “Tag” Taglianetti, Norman Drews, and Matt Perez.

To help further isolate the processor utilization in IIS we also used a tool called the Microsoft Server Performance Advisor (SPA) which is a free download from Microsoft.com for Windows Server 2003 and part of the operating system in Windows 2008 and later. The SPA tool uses performance counter log and Event Tracing for Windows (ETW) to provide detailed information about IIS’s inner workings. Here is a clipping from the SPA report showing a very high amount of CPU consumption by the Default.aspx page.

image

For more information on how to use the Microsoft Server Performance Advisor (SPA) tool, see my How To article:

How to Identify a Disk Performance Bottleneck Using the Microsoft Server Performance Advisor (SPA) Tool
http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20a%20Disk%20Performance%20Bottleneck%20Using%20SPA1&referringTitle=How%20Tos

Once we had the functions found on the stack the most and cross referenced them with the code in Default.aspx, the customer was able to focus their efforts on specific areas of their code. Unfortunately, there was no quick fix for this performance issue and the customer has a lot of work ahead of them. At the very least, they are on the road to recovery and they know how to identify and alleviate these issues on their own now.

References

Microsoft Server Performance Advisor (SPA)
http://www.microsoft.com/downloads/details.aspx?FamilyID=09115420-8c9d-46b9-a9a5-9bffcd237da2&DisplayLang=en

How to Identify a Disk Performance Bottleneck Using the Microsoft Server Performance Advisor (SPA) Tool
http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20a%20Disk%20Performance%20Bottleneck%20Using%20SPA1&referringTitle=How%20Tos

How To: Identify Functions causing a High User-mode CPU Bottleneck for Server Applications in a Production Environment
http://www.codeplex.com/PerfTesting/Wiki/View.aspx?title=How%20To%3a%20Identify%20a%20Disk%20Performance%20Bottleneck%20Using%20SPA&referringTitle=Techniques%20for%20Performance%20Testing

The Case of the Out of Memory BizTalk Server
07 July 09 01:30 PM | clinth@microsoft.com | 1 Comments   

About a month ago, I was conducting a BizTalk Health Check for a customer (yes, Microsoft BizTalk Server is one of my other specialties along with several others) where the BizTalk server has been running slower than expected. They noticed Out of Memory exceptions in the event logs, but since BizTalk doesn’t loose messages and restarts on failure, this wasn’t a big deal.

Now, when a process runs out of memory, it means that it ran out of virtual memory – meaning it ran out of memory in it’s virtual world. Now, if you are thinking, “What the heck is Clint talking about?”, then this blog entry is for you because I will explain it. Otherwise, you can skip the next few paragraphs.

Begin virtual memory history lesson…

How many people remember DOS? When I ask this question when teaching the Vital Signs workshop, I typically get about half the class raising their hands. How much memory can DOS address? 640K of course. If you wanted to run a game (such was my life back in those days), you had to free up enough memory to run it. This meant editing the Autoexec.bat and config.sys files – wow, I still remember those file names :-). Anyway, this behavior was the same in Windows 3.1 because it was based on DOS. When you tried to open a large Word document (or game), it would fail because it didn’t have enough memory. Users just wanted it to work no matter how slow it was, so virtual memory was invented.

In 32-bit Windows NT4, the operating system works on the architecture of virtual memory. Windows NT v3.5 supported it as well, but that was before my time. In any case, virtual memory is a way to fake the process into thinking it has 2GBs of memory by placing the process in a “virtual” world similar to “The Matrix” movie. Now, back then 2GBs was a *huge* amount of memory and *each* process had it’s own 2GBs of virtual address space. So an NT4 server with 8MBs (yes, I am talking megabytes) of memory could open up a bunch of 10MB Word documents with no problem other than being… very slow. But who cares, it worked right? Well, the kernel still has to put this memory somewhere. Since it couldn’t fit the 10MB documents into RAM, it now uses the page file on the disk as additional physical memory which is why it would be very slow to open.

Technically speaking, I lied to you about the size of the virtual address space. Each process actually gets 4GBs of virtual memory – it just has to share that memory with the kernel. The kernel gets 2GBs and the process gets 2GBs by default. When I first heard of this I was thinking that there is a kernel for each process because that is what it sounds like. Think of it as a hub and spoke model where the kernel is the hub with 2GBs of memory and each process is a 2GB spoke. The 4GB virtual address space is simply the amount of memory that a 32-bit processor can address.

Now, let’s upgrade to Windows 2003 32-bit. Today, we have massive servers with 16GBs or more of RAM. Guess what? It’s still a 32-bit world. Each process still lives in the same 4GB virtual address space with the kernel (2GBs to each process and 2GBs to the kernel). The problem is that our servers now have far more RAM than virtual address space. This means that products like SQL Server are limited to the 2GBs of user-mode virtual memory regardless of how much RAM is installed on the server. Your server can still use all of that RAM, but since each process is limited to 2GBs of virtual memory, it just takes a *lot* of processes to use all of that RAM.

Quiz Time… If a 32-bit application is out of memory, is it out of virtual memory or out of physical memory? Answer: It is out of virtual memory. Remember that each process is living in a virtual world and has no idea what the real world (physical memory) looks like.

If you want a single process to be able to use all of the RAM on the server, then the process needs to have more virtual address space than the amount of RAM installed. Therefore, the answer is really just go to 64-bit (implemented as x64) which gives each process 16TBs of virtual address space per process - 8TB for the process and 8TB for the kernel.

End Virtual Memory History Lesson

Okay, let’s get back to the mystery… we have a BizTalk Server service (technically called a BizTalk host instance) that is running out of memory. Did the server run out of physical memory or did the service/process run out of virtual address space? If you read my history lesson above, then you know that BizTalk ran out of virtual address space. Since the server is a 32-bit server, this means that that BizTalk is not able to address more than 2GBs. Effectively, the BizTalk server had an aggressive memory leak, soaked up the 2GBs of virtual memory very quickly, then once I ran out of virtual memory, it crashed.

Question: What could have happened if a process is only using about 1.2GBs of virtual address space (about 60% of it’s virtual memory) yet ran out of memory?

Answer: An issue called heap fragmentation.

Heap memory is just memory that a process uses to store and retrieve data. If a process uses a 64K block of memory, then frees 60K of that memory in the middle of the block, you now have two 2K blocks of memory with a 60K gap in the middle. The .NET Framework allocates memory in 64K blocks of memory. This memory *must* be contiguous, therefore, the 64K memory allocation cannot fit between the two 2K blocks of memory, so the memory allocation must go higher in the memory addresses to find a contiguous block. If this pattern continued indefinitely, then you would end up with about 1.2GBs out of 2GBs of used virtual memory yet be out of virtual memory because the .NET Framework is unable to create a new 64K block of contiguous memory.

Luckily, the .NET Framework is well aware of heap fragmentation and it has a garbage collector (literally called the Garbage Collector or often called the “GC”) the defragments memory (consolidates the memory to make it contiguous) as it goes. This works great when you are using all .NET assemblies, but if you have COM object (native code – non-.NET), then that COM object cannot be collected/defragged by the GC and could potentially fragment the heap memory aka “heap fragmentation”.

In this case, the BizTalk server ran out of virtual memory that was likely caused by a non-.NET COM object. Once I mentioned this behavior to the BizTalk developer, he immediately had an idea of which object was likely causing the memory leak.

The reason we know that we have a memory leak is because we looked at the “\Process(*)\Private Bytes” counter. If this counter is showing an increasing trend over a long period of time and eventually results in an out of [virtual] memory exception which results in a crash in the process.

To get more information about what is leaking memory, then use a tool like DebugDiag which is a free download from Microsoft.com. Use it to attach to the leaky process shortly after it has ran for awhile. DebugDiag will track the memory allocations, then eventually report which components are leaking memory – not releasing memory that it should have released. DebugDiag was written by some guys I know in Microsoft Support in Charlotte, NC such as Jeremy Phelps.

Another tactic is to use ADPlus.vbs from the Microsoft Debugging Tools for Windows package to get a memory dump (*.dmp) of the leaky process before and after the processes has consumed a large amount of memory. Next, use WinDBG to debug the dmp files. The Microsoft Debugging Tools for Windows is a free download from Microsoft.com.

ADPlus.vbs was originally written by a good friend of mine, Robert Hensing who is now a Window security demi-god. Check out his blog postings at: http://blogs.technet.com/robert_hensing

Yet another tactic is to use a tool called VAView. This tool reads a process dump file (*.dmp) and visually shows memory allocations very similar to the disk layout when you do a disk defrag analysis. All you do is look for the “speckling” or plaid pattern – processes really don’t look good in plaid. VAView was written by a colleague of mine named John Allen. I don’t think he has it published anywhere on the internet, so if you would like this tool, then let me know.

Getting back to the case at hand, I confirmed that they had a memory leak in BizTalk by watching the “\Process(*)\Private Bytes” performance counter trend upwards, then eventually hit the 2GB virtual memory limitation, and crash. The “\Memory\Availible MBytes” counter showed that the server still had plenty of RAM available, so I came up with the following recommendations:


  1. Separate the BizTalk artifacts/components into other BizTalk Hosts: If there is more than one artifact in the BizTalk host (process), then move the artifacts to different BizTalk hosts. This distribution of the artifacts into different processes (each with 2GBs of virtual memory) might be enough to handle the memory consumptions.
  2. Migrate to 64-bit BizTalk Server: The 64-bit virtual address space is exponentially larger than the 32-bit virtual address space. Moving to 64-bit will effectively resolve this issue because the server still has about 1GB of free RAM on the server. Unfortunately, the BizTalk Servers were on 32-bit hardware, so this would mean new hardware.
  3. Track the Memory Consumption: We can use memory leak tools to troubleshoot this issue. A good one to use is DebugDiag which is a free tool from http://download.microsoft.com.

The following knowledge base article has a lot of great information on possible causes of memory leaks in Microsoft BizTalk Server.

    How to troubleshoot a memory leak or an out-of-memory exception in the BizTalk Server process
    http://support.microsoft.com/kb/918643

This was a really large blog posting, but I had to cover a lot of background concepts first. I hope these blog posts help you with your own mysteries.

Another Case of the Missing PTEs
18 June 09 11:02 AM | clinth@microsoft.com | 2 Comments   

I did an IIS Health Check (IISHC) this week and found another 32-bit Windows 2003 web server that was out of System PTEs. This blog entry talks about how we resolved this issue.

I detected this issue by looking at the boot.ini file which is one of the validations we do in the IISHC. The boot.ini file had the /3GB switch, the /PAE switch, and a few others. My concern is that the /USERVA switch was not in the boot.ini file. I immediately brought up Perfmon and looked at the “\Memory\Free System Page Table Entries” performance counter. It had a value of about 2300 which puts the server in a critical condition. When a server has less than 5,000 free PTEs, then it is considered critical because the server will begin failing to allocate memory to processes.

Per the Vital Signs Workshop (created by Shane Creamer) offered by my organization:
“\Memory\Free System Page Table Entries”
Warning: If less than 10,000
Critical: If less than 5,000

Without going into too much detail, the /3GB switch steals 1GB of virtual memory from the kernel and gives it to processes. This allows processes that are compiled with LARGE_ADDRESS_AWARE to use 3GBs of memory versus the standard 2GBs. While this sounds like a good thing, the kernel needs memory too and you can cut into it too much very easily and the first casualty is System Page Table Entries (PTEs) which are needed to map virtual memory to physical memory.

In a nutshell, if you use the /3GB switch in the boot.ini, then you should use /USERVA=3030. The /3GB switch puts the server at 3072MBs (3GBs), so the /USERVA switch moves that line between the process memory and the kernel memory at a new location. For example, a value of 3030 will give back 42MBs to the kernel going directly toward System PTE’s.

For detailed troubleshooting on System PTEs, see my (and other contributors) blog entry at: http://blogs.technet.com/clint_huffman/archive/2008/04/07/free-system-page-table-entries-ptes.aspx

In any case, this web server never needed the /3GB switch in the first place. I determined this by looking at the “\Process(*)\Virtual Bytes” performance counters. All of the processes were using less then 1.1 GBs of virtual memory out of 3GBs. Since the default (non-/3GB system) is 2GBs of virtual address space per process, there is no need to have /3GB on this server.

For good measure, I fired up WinDBg (a kernel debugger) and did a “!vm” to see what the virtual memory of the kernel looked look. It had an alert saying there had been about 1 million failed PTE allocations, so this confirmed the critical condition.

After discussing this with the customer, we agreed to remove the /3GB switch. After the reboot, the virtual memory of the server looked *much* better. We showed about 153,000 free PTEs and the Pool Paged and Pool Non-Paged kernel memory maximum sizes doubled (good thing), but of course, this reverted the maximum addressable memory for processes back to 2GBs.

In conclusion, if you use the /3GB switch, then also use the /USERVA switch. Ultimately, all of this would not be problem on 64-bit Windows on the same hardware because 64-bit Windows can address up to 16TBs per process (8TB for each process and 8TB for the kernel) by default.

A quick note about power savings (Green IT) through performance analysis

This web server had 16GBs of RAM installed. The “\Memory\Available MBytes” performance counter shows that the server always has over 14GBs of RAM available. This is great for performance, but wasteful on power consumption.

Bruce Worthington mentioned to me a few months ago that one of the biggest power consumers in PC hardware is the RAM modules. Therefore, if your server constantly has free RAM available even during peak load, then you can save some money on power by removing some of the unused RAM from the server. You can measure this using power meter in the AC outlet measuring the kilowatts used before and after the RAM has been removed. Bring these numbers back to your manager on how your company can save money.

Alternatively, you could run virtual computers on the same hardware to make use of that extra RAM completely saving on the AC power needed to power another physical computer.

I hope you find this blog entry useful. Thank you!

The Case of the Missing PTEs
15 June 09 11:53 PM | clinth@microsoft.com | 1 Comments   

It was a dark and stormy night… actually it was sunny and clear, but just thought I’d start this blog series like a classic novel. :-)

This is the first blog entry of my new blog related to my job. I made a mistake when creating my first one and ended up using my full name versus my email name ‘clinth’. My old blog is at http://blogs.technet.com/clint_huffman and my new blog is at http://blogs.technet.com/clinth. This new blog will feature my troubleshooting challenges each week.

I am a Microsoft Premier Field Engineer (PFE) and I have been at Microsoft in various support roles for the past 10 years. My job as a PFE is to go onsite with our customers and assist them with difficult problems, identify new problems through health checks, and/or deliver training. This week was an urgent, reactive, issue. So, without further a do, let’s talk about it.

I received a call from one of our ROSS (Remote On-site Services) coordinators at 2:30am. I was completely asleep when I got the call and I had to remember who I was and why I had a bright light [phone] next to my face. Once I realized what was going on, she politely asked me if I could look at a case and see if I have the right skill set to handle a case.

The case was a Microsoft SQL Reporting Services 2005 case where the customer received a DTD XML error and received a timeout expired error with it. I am not a SQL guy, but they are asking for an IIS person to assist – one of my specialties. I chatted via instant messenger (IM) with the support professional who owned the case and he explained that they believes this is an IIS permissions related issue. With that in mind, and nothing else on my plate, I figured what the heck.

I live in Seattle and took the first flight out to Phoenix, AZ where the customer was located. It took me most of the day to get there and I arrived around 5:30pm. I joined the conference call that was already in progress with two other support professionals. I learned that this case has been open for 3 weeks and they are just about out of options trying to figure out why it is not working.

During our conversation, the customer mentioned that the SQL Reporting Services (SRSS) works shortly after a reboot, this told me a *lot*. First that it is likely not security or configuration related simply because it *works* for awhile after the reboot. I teach the Vital Signs workshop which is a Windows Architecture workshop focused on performance analysis, so my first thought was to look at the performance counters of the server to see what the memory consumption looks like. Sure enough, I found that the “\Memory\System Free Page Table Entries” was down to 500. This counter is considered critical if below 5,000, and this is the lowest value I have ever seen in my career.

System Page Table Entries (PTEs) is what the kernel uses to keep track of virtual memory to physical memory mappings. If the kernel was an accountant, then it would be similar to the accountant running out of paper. Effectively, no new memory allocations are permitted until PTE memory is freed up. Since the condition on this server was so severe, we needed to solve this problem first. In retrospect, Obi Wan might say that the symptoms did match the root cause when you look at it from a different “perspective”. In any case, let’s jump in to the details.

Software:

  • Microsoft Windows Server 2003 Service Pack 1 (32-bit) (Win2K3) with 16GBs of RAM
  • Microsoft SQL Server 2005 Reporting Services (SRSS)
  • Microsoft Internet Information Services 6 (IIS6)

Symptoms:

  • SQL Reporting Services failed with the error, “For security reasons DTD is prohibited in this XML document.”
  • The SQL Reporting Services worked properly after a reboot, then fail over time.
  • Odd things are happening on the server such as perf counters not displaying all of the instances, new applications are unable to start due to a lack of system resources.
  • Event ID 333’s are in the event log
  • “\Memory\System Page Table Entries” is at 500 – Critical when below 5,000.

Normally, when a Windows 2003 server is out of PTEs we simply use the /USERVA boot.ini switch to give more memory back to the kernel, but in this case the /3GB switch was not being used. This was the first time that I have seen a server out of PTEs when /3GB is not used because /3GB effectively steals 1GB of virtual memory from the kernel to allow applications like SQL Server to address up to 3GBs of virtual memory versus 2GB. Since /3GB was not configured on this server, I will explain /3GB in future blog postings. In the meantime, you can look at my first blog posting at http://blogs.technet.com/cotw (Counter of the week).

I fired up a kernel debugger called WinDBG and did a “!vm”. This displayed the virtual memory allocations of the kernel. Oddly enough, I didn’t find anything unusual other than the severe lack of PTEs which I already know about. At the very least this confirmed my suspicions about the lack of System PTEs.

I suspected a PTE leak over time, so I asked the customer to reboot the server. One of the other Microsoft Support Professionals asked for a full kernel dump before we rebooted. I agreed because after the reboot, we might not get back to this broken state again.

After getting the full kernel dump using the “Not My Fault” SysInternals application, we rebooted the server. To my surprise, the server was at 3,000 free PTEs which is still in the critical zone of being below 5,000.

At a loss as to what to do next, I called up a colleague of mine, Ben Christenbury. His first thoughts are to disable Hot Add memory. Hot Add memory is a feature of Windows 2003 Server that allows the addition of more physical RAM while the server is still running. Therefore, if a server’s motherboard has the potential of 256GBs or RAM, then PTEs and other memory resources such as Paged Pool memory are reserved for this potential event. The customer’s server had 16GBs of RAM installed, but likely could go up to 256GBs. Therefore, the system is likely reserving large amounts of PTE and other kernel memory for  the hot add memory feature. He recommended setting the DynamicMemory registry key to 1 which tells the Hot Add memory feature to prepare the system for either 1GB of RAM or the amount of RAM already installed whichever is larger. In this case, the server has 16GBs, so setting DynamicMemory to 1 (1GB) effectively disables the Hot Add Memory feature. Keep in mind that the hot add memory feature can also be disabled at the BIOS of the server. If the customer wanted to potentially add 16GB more RAM (32GBs total RAM), then I would have advised to set DynamicMemory to 32 (32GBs). The following knowledge base article explains this registry key in more detail:

How to Configure the Paged Address Pool and System Page Table Entry Memory Areas
http://support.microsoft.com/kb/247904

The DynamicMemory registry key is located under the following registry key:

HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management

When we went to the Memory Management registry key we found other issues of concern. The SystemPages key was set to 798720 and the PagedPoolSize key was set to 1107296255 (1GB). Both of these keys should be set to 0 (0 is the default setting) versus static sizes. We set SystemPages to 0 and set DynamicMemory to 1, then rebooted. Unfortunately, the system was still at about 3,000 free PTEs. That’s when I remembered that Ben Christenbury had told me that the PagedPoolSize setting will take precedence – meaning it’s very likely that the paged pool registry key is reserving 1GB of kernel virtual memory even though Paged Pool was really only taking up 366MBs – likely putting a 634MB gap of unused memory in the kernel. Paged Pool cannot be set larger than 650MB, but I have never seen it larger than 366MBs. In any case, we set PagedPoolSize to 0 which allows the system to set the real size of it appropriately at boot time, then rebooted again. This time we got about 130,000 free PTEs!

I monitored the system the rest of the day and it remained functional the entire time. The customer agreed that a lack of PTEs was the most likely cause of the problem and we closed the case.

Root Cause: The most likely root cause of this problem was the PagedPoolSize registry key being set to 1107296255 (1GB). The customer does not know how or why the key was inappropriately set. We speculate that one of the video card drivers set it incorrectly, but we do not have proof of that and I want to remain objective in my findings by only reporting the facts.

Solution: In this case, we had three changes that could have had a cumulative effect. We changed SystemPages from 798720 to 0, PagedPoolSize from 1107296255 [1GB] to 0, and created the DynamicMemory key and set it to 1 [1GB].

In conclusion, if you are running 32-bit, then keep a close eye on virtual memory resources and migrate to 64-bit as soon as you can. It will save you an enormous amount of time and energy when troubleshooting problems. For more information on PTE troubleshooting, check out our blog entry at: http://blogs.technet.com/cotw/archive/2008/04/07/symptoms-lack-of-free-system-page-table-entries-ptes-and-system-wide-delays-i-o-request-failures.aspx

The Performance Analysis of Logs (PAL) tool (a log analysis tool I wrote in collaboration with other Microsoft employees) advises when System PTEs are low. Download it for free at:
http://pal.codeplex.com.

If you would like to learn more about Windows architecture and performance analysis, then consider the Vital Signs workshop offered by my organization, Microsoft Premier Field Engineering (PFE).

I hope you enjoyed this blog entry. Tune in again for another Windows troubleshooting adventure!

If you would like to use Microsoft Premier Field Engineering (PFE) resources (onsite problem solving, training, risk assessments, etc) and if you have a Microsoft Premier Support Agreement, then contact your Microsoft Technical Account Manager (TAM). If you do not have a Microsoft Premier Support Agreement, then go to http://www.microsoft.com/services/microsoftservices/srv_premier.mspx for more info.

Search

This Blog

Syndication

Page view tracker