The Case of the Disappearing Memory

The Case of the Disappearing Memory

  • Comments 2
  • Likes

This article has been contributed by Vasco Preto, a Premier Field Engineer with Microsoft Portugal. In this very interesting case study, he shows us how the file system cache can sometimes ‘steal’ vast amounts of physical RAM. Read on!


I’ve been recently involved in troubleshooting a production problem reported by a customer, which revolved around the strange disappearance of available memory from production servers. Here’s a quick recap of what the exact symptoms were and the analysis and diagnostic that resulted from the troubleshooting effort.

Scenario

The client presented a specific case where a production server (running 64-bit Windows 2008 R2) was simply running out of available memory without any apparent reason. After an initial assessment it became obvious that:

  1. The server’s available memory was in fact being consumed by unknown reasons.
  2. This strange memory consumption, which was effectively exhausting all available memory, was happening at what seemed to be a regularly scheduled timeframe.
  3. Whenever this issue appeared, the server’s 16GB of physical memory was totally consumed by an unknown process, leaving the server almost depleted of any available memory.

Analysis

Going through available performance counters, namely Memory\Available MBytes one could see a drastic drop in this key indicator (as shown in the next picture), leading to a scenario where several application logs were revealing errors due to the inability to allocate more memory within the server.

Perfmon - Available MB trend

Interestingly, going through performance counter Process\Private Bytes for all running processes, we simply could not see where the memory was being used. It simply seemed to have disappeared.

However, if we turned our focus to the performance counter Memory\System Cache Resident Bytes we did come up with a very interesting graphic (which is quite similar to the previous one, but in an inverted way).

image

As a side note, and from the available performance counter description, the counter Memory\System Cache Resident Bytes refers to “… the size, in bytes, of the portion of the system file cache which is currently resident and active in physical memory. The System Cache Resident Bytes and Memory\Cache Bytes counters are equivalent. This counter displays the last observed value only; it is not an average.”

This meant that almost all memory was actually being consumed by something that was leading to a high usage of system cache, which isn’t reported directly in counter Memory\Privates Bytes of any running process.

Analyzing the scenario using RamMap utility we get the following breakdown of memory usage by process (which only confirms what was previously stated, as no process, or set of processes, is even remotely close to the elusive 16GB of memory):

RAMMap - display by process

However, using RamMap to drill down on memory usage by type/purpose/trait we get the following very interesting and distinct picture:

RAMMap - display by usage

Through this picture we can clearly see that there are approximately 11GB of runaway memory being reported as Metafile and only approximately 3GB of memory being reported as Process Private. And as expected, Unused memory was extremely low (around 180MB).

Considering that we were facing a consumption of memory that was following a specific scheduled pattern, and that memory was being used through System Cache, we decided to look into scheduled tasks to see if something suspicious could be tracked to that specific timeframe.

Interestingly there was an anti-virus scan that was scheduled to run at precisely these problematic time periods. So the only question left was whether this action was actually using so much System Cache memory. By reviewing this scheduled task’s configuration, we determined that the anti-virus process was configured to execute a scan over a specific storage area that hosted mailboxes.

After a few experimental steps, we discovered that simply by putting this storage offline, we would see an immediate release in the memory allocated as System Cache.

So faced with these facts we considered whether we were before some sort of bug in the driver layer that was reading data from the storage area, which would allow data to be placed in memory in System Cache, without any sort of limit (until there was available memory) and without any cache clean up procedure in place. Would this make any sense?

After some research we found some interesting documentation on 64-bit systems: in these systems there isn’t a parameter that allows an administrator to configure the maximum amount of data that is placed in System Cache, although one can create a workaround by using the API SetSystemFileCacheSize.

The links here and here here describe this precise situation (which is quite different from what happens in 32-bit systems where a configuration parameter does exist). Within the information contained in them, one can even find a link to a service (including source code and binaries) that allows you to limit the memory used by System Cache by using the previously referred to SetSystemFileCacheSize API call.

Conclusion

So the conclusion is quite interesting: the anti-virus was forcing a massive read over the storage area, which was placing large amounts of data in System Cache. In 64-bit systems there isn’t a configuration parameter to limit it directly, probably due to the assumption that there is no reason for the operating system to be that restrictive as these systems usually have a lot more memory than the constrained world of 32-bit systems. And this led to the disappearance of virtually all available memory!

We hope you enjoyed reading this as much as we enjoyed bringing it to you. Please do leave a comment, and take a moment to rate this post!


Posted by Arvind Shyamsundar, MSPFE Editor.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Perfect write up - Thanks!

  • Thanks for the article.  Yep we got bit by this one at work and the solution thus far has been the service that calls the SetSystemFileCacheSize API.  A kludgy fix but ok for now.  Hopefully this will be remedied in Windows soon.