It was a dark and stormy night… actually it was sunny and clear, but just thought I’d start this blog series like a classic novel. :-)
This is the first blog entry of my new blog related to my job. I made a mistake when creating my first one and ended up using my full name versus my email name ‘clinth’. My old blog is at http://blogs.technet.com/clint_huffman and my new blog is at http://blogs.technet.com/clinth. This new blog will feature my troubleshooting challenges each week.
I am a Microsoft Premier Field Engineer (PFE) and I have been at Microsoft in various support roles for the past 10 years. My job as a PFE is to go onsite with our customers and assist them with difficult problems, identify new problems through health checks, and/or deliver training. This week was an urgent, reactive, issue. So, without further a do, let’s talk about it.
I received a call from one of our ROSS (Remote On-site Services) coordinators at 2:30am. I was completely asleep when I got the call and I had to remember who I was and why I had a bright light [phone] next to my face. Once I realized what was going on, she politely asked me if I could look at a case and see if I have the right skill set to handle a case.
The case was a Microsoft SQL Reporting Services 2005 case where the customer received a DTD XML error and received a timeout expired error with it. I am not a SQL guy, but they are asking for an IIS person to assist – one of my specialties. I chatted via instant messenger (IM) with the support professional who owned the case and he explained that they believes this is an IIS permissions related issue. With that in mind, and nothing else on my plate, I figured what the heck.
I live in Seattle and took the first flight out to Phoenix, AZ where the customer was located. It took me most of the day to get there and I arrived around 5:30pm. I joined the conference call that was already in progress with two other support professionals. I learned that this case has been open for 3 weeks and they are just about out of options trying to figure out why it is not working.
During our conversation, the customer mentioned that the SQL Reporting Services (SRSS) works shortly after a reboot, this told me a *lot*. First that it is likely not security or configuration related simply because it *works* for awhile after the reboot. I teach the Vital Signs workshop which is a Windows Architecture workshop focused on performance analysis, so my first thought was to look at the performance counters of the server to see what the memory consumption looks like. Sure enough, I found that the “\Memory\System Free Page Table Entries” was down to 500. This counter is considered critical if below 5,000, and this is the lowest value I have ever seen in my career.
System Page Table Entries (PTEs) is what the kernel uses to keep track of virtual memory to physical memory mappings. If the kernel was an accountant, then it would be similar to the accountant running out of paper. Effectively, no new memory allocations are permitted until PTE memory is freed up. Since the condition on this server was so severe, we needed to solve this problem first. In retrospect, Obi Wan might say that the symptoms did match the root cause when you look at it from a different “perspective”. In any case, let’s jump in to the details.
Normally, when a Windows 2003 server is out of PTEs we simply use the /USERVA boot.ini switch to give more memory back to the kernel, but in this case the /3GB switch was not being used. This was the first time that I have seen a server out of PTEs when /3GB is not used because /3GB effectively steals 1GB of virtual memory from the kernel to allow applications like SQL Server to address up to 3GBs of virtual memory versus 2GB. Since /3GB was not configured on this server, I will explain /3GB in future blog postings. In the meantime, you can look at my first blog posting at http://blogs.technet.com/cotw (Counter of the week).
I fired up a kernel debugger called WinDBG and did a “!vm”. This displayed the virtual memory allocations of the kernel. Oddly enough, I didn’t find anything unusual other than the severe lack of PTEs which I already know about. At the very least this confirmed my suspicions about the lack of System PTEs.
I suspected a PTE leak over time, so I asked the customer to reboot the server. One of the other Microsoft Support Professionals asked for a full kernel dump before we rebooted. I agreed because after the reboot, we might not get back to this broken state again.
After getting the full kernel dump using the “Not My Fault” SysInternals application, we rebooted the server. To my surprise, the server was at 3,000 free PTEs which is still in the critical zone of being below 5,000.
At a loss as to what to do next, I called up a colleague of mine, Ben Christenbury. His first thoughts are to disable Hot Add memory. Hot Add memory is a feature of Windows 2003 Server that allows the addition of more physical RAM while the server is still running. Therefore, if a server’s motherboard has the potential of 256GBs or RAM, then PTEs and other memory resources such as Paged Pool memory are reserved for this potential event. The customer’s server had 16GBs of RAM installed, but likely could go up to 256GBs. Therefore, the system is likely reserving large amounts of PTE and other kernel memory for the hot add memory feature. He recommended setting the DynamicMemory registry key to 1 which tells the Hot Add memory feature to prepare the system for either 1GB of RAM or the amount of RAM already installed whichever is larger. In this case, the server has 16GBs, so setting DynamicMemory to 1 (1GB) effectively disables the Hot Add Memory feature. Keep in mind that the hot add memory feature can also be disabled at the BIOS of the server. If the customer wanted to potentially add 16GB more RAM (32GBs total RAM), then I would have advised to set DynamicMemory to 32 (32GBs). The following knowledge base article explains this registry key in more detail:
How to Configure the Paged Address Pool and System Page Table Entry Memory Areas http://support.microsoft.com/kb/247904
The DynamicMemory registry key is located under the following registry key:
HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management
When we went to the Memory Management registry key we found other issues of concern. The SystemPages key was set to 798720 and the PagedPoolSize key was set to 1107296255 (1GB). Both of these keys should be set to 0 (0 is the default setting) versus static sizes. We set SystemPages to 0 and set DynamicMemory to 1, then rebooted. Unfortunately, the system was still at about 3,000 free PTEs. That’s when I remembered that Ben Christenbury had told me that the PagedPoolSize setting will take precedence – meaning it’s very likely that the paged pool registry key is reserving 1GB of kernel virtual memory even though Paged Pool was really only taking up 366MBs – likely putting a 634MB gap of unused memory in the kernel. Paged Pool cannot be set larger than 650MB, but I have never seen it larger than 366MBs. In any case, we set PagedPoolSize to 0 which allows the system to set the real size of it appropriately at boot time, then rebooted again. This time we got about 130,000 free PTEs!
I monitored the system the rest of the day and it remained functional the entire time. The customer agreed that a lack of PTEs was the most likely cause of the problem and we closed the case.
Root Cause: The most likely root cause of this problem was the PagedPoolSize registry key being set to 1107296255 (1GB). The customer does not know how or why the key was inappropriately set. We speculate that one of the video card drivers set it incorrectly, but we do not have proof of that and I want to remain objective in my findings by only reporting the facts.
Solution: In this case, we had three changes that could have had a cumulative effect. We changed SystemPages from 798720 to 0, PagedPoolSize from 1107296255 [1GB] to 0, and created the DynamicMemory key and set it to 1 [1GB].
In conclusion, if you are running 32-bit, then keep a close eye on virtual memory resources and migrate to 64-bit as soon as you can. It will save you an enormous amount of time and energy when troubleshooting problems. For more information on PTE troubleshooting, check out our blog entry at: http://blogs.technet.com/cotw/archive/2008/04/07/symptoms-lack-of-free-system-page-table-entries-ptes-and-system-wide-delays-i-o-request-failures.aspx
The Performance Analysis of Logs (PAL) tool (a log analysis tool I wrote in collaboration with other Microsoft employees) advises when System PTEs are low. Download it for free at: http://pal.codeplex.com.
If you would like to learn more about Windows architecture and performance analysis, then consider the Vital Signs workshop offered by my organization, Microsoft Premier Field Engineering (PFE).
I hope you enjoyed this blog entry. Tune in again for another Windows troubleshooting adventure!
If you would like to use Microsoft Premier Field Engineering (PFE) resources (onsite problem solving, training, risk assessments, etc) and if you have a Microsoft Premier Support Agreement, then contact your Microsoft Technical Account Manager (TAM). If you do not have a Microsoft Premier Support Agreement, then go to http://www.microsoft.com/services/microsoftservices/srv_premier.mspx for more info.