kennymaita

PFE Experiences and more

Top Customer Misconceptions about Software Problems

Top Customer Misconceptions about Software Problems

  • Comments 0
  • Likes

Roberto Alexis Farah is an exceptional Senior Premier Engineer at Microsoft with a lot of experience troubleshooting applications in the field. He has written such a realistic post about Misconceptions that customers around the globe have about their software problems.

I truly recommend you to follow his Blog here http://blogs.msdn.com/b/debuggingtoolbox/ cause this is valuable learning resource.

Here are some of the misconceptions he think and I agree about what customers believe is the way to manage a software problem:

 

1- For reactive incidents: “Bring the engineer onsite because it is going to be easier to isolate the problem.”

Tst… tst… tst… this is the most common misconception I’ve heard. Consider this, a customer’s application is having problems and after our scoping call I have a clear idea about the actions to be taken, including dump files. Ok, so, in the extreme, I fly to the other end of the country, which is 3 hours of time zone difference, so I arrive late the first day before my visit and the following morning I go visit my customer. I am tired as usually happens with trips like this and no, I won’t be more productive than if working remotely.

Let me explain: most complex problems require deep debugging sessions. Collecting the necessary information is the easy part and can be done remotely or by the customer. However, most of my time, several hours or days is going to be dedicated to debugging the dump files. Being onsite means I do this in less than ideal conditions, without access to my books, sometimes without access to our private symbols, and being constantly interrupted to give status or participate in meetings. On top of that, I’m usually tired.

Now, if I’m working remotely it is going to be cheaper for the customer, more productive for me and we can start working earlier because I don’t need to spend time travelling.

I do recognize that sometimes either for political reasons, to act as ears and eyes for a remote engineer or to get a better understanding of weird problems that we can’t understand very well by e-mail or phone it is more effective to be onsite. However, these situations are exceptions, not the rule.

Also, some proactive incidents require onsite but their nature is totally different from reactive incidents. I would say that proactive work benefits more from onsite compared to remote work.

2- “We need a Code Review because our application has performance problems.”

Sometimes I receive requests for Code Review but what the customer needs, in reality, is Problem Isolation. The Code Review is what we call a proactive offer. The goal is to review the source code and point the parts of the code that are not following the Best Practices, or that represents security holes or yet parts of the code that can be optimized for speed.

So after the Code Review the customer receives a nice report which has explanations about the potential problems and actions to solve them.

Therefore, if the customer’s application is having problems a Code Review is unlikely to be the answer.

Let me explain: imagine a scenario where an ASP.NET application is suffering from poor performance. If I code review the application I may find methods that can be optimized for speed, however, if the application has slow performance, because, for example, there is a bottleneck on the database side or network, the performance gain from the code review is not going to solve the problem. Worse than that, it may not even be noticeable!

The Code Review is great when you want to make sure your application doesn’t have potential problems that could be avoided by implementing Best Practices or if you think the application can be further optimized to gain more speed. However, you’ll only be able to measure the speed gain if your application was running fine in the first place, without the effect of external bottlenecks. Moreover, usually the performance gain is not as significant as removing the bottlenecks.

3- “So after fixing this problem I suppose the performance/memory issue is going to be normalized, right?”

I wish that most of my cases were like this, only one problem causing the symptom. The reality is different though usually there are different problems causing symptoms like slow performance, hangs or memory issues. For crashes the relationship is usually one to one, but not always.

What does this mean? It means that after solving the most significant and visible problem we need to monitor the application because other minor problems could be causing the same symptom and after fixing the main bottleneck these other problems will become visible and easier to isolate. It’s a cyclical process. In case you would like to know more about it read this old post here.

4- “We’re using .NET so I don’t need to worry about memory management.”

Again, this is the real world folks! J If you have a pure .NET application I agree, however, most commercial applications have some kind of interaction with the Native World, like C DLLs, COM objects or API calls.

The CLR is great to manage memory… from pure .NET applications. If your application is interacting with native code it is the developer’s responsibility to make sure that resources are released/closed.

Just as a curiosity, memory issues are, in my opinion, the most common problem for commercial .NET applications because of the interaction with native code.

5- After asking the customer to collect 3 dump files from a specific web application: “I just collected the dump files you asked me plus dump files from all other web applications, SQL Profiler logs, Performance Monitor logs and Event Viewer logs. I guess this is enough information to solve the issue.”

This is the classical “too much information” problem. More often than you think the additional information is not going to help us isolate the problem. In other words, don’t assume that more is always better.

What is most important for us is to get the right information. One dump file from your problematic application collected when experiencing the symptom is very valuable. Five dump files from the problematic application collected when it was running fine are useless to us.

Got it? J

6- “We need an Architecture Review because our application has performance problems.”

This is similar to item 2 above. An Architecture Review is not the best approach to solve immediate problems and it may not even be the right approach to solve most application problems because usually these application problems are too granular, which means the customer’s application may be right from the Architectural point of view but still having problems not related to the way the architecture was designed.

Examples? Ok, imagine that you haven’t installed an important update for the .NET Framework which is impacting your application. Or that your SharePoint application is not releasing the internal SharePoint objects it’s using. An Architecture Review is not going to uncover problems like that.

7- “We need an IIS Engineer because my W3WP.EXE is consuming too much memory. It may be an IIS bug.”

Classical problem: Web Application/SharePoint with slow performance.

This is just an example of how people from different specialties react to the problem, don’t take it seriously:

End user: I think the browser has a problem, the application is slow.

IIS Administrator: I think the problem is the ASP.NET application.

Developers: The ASP.NET application is running fine; the problem is probably on the database side.

DBA: The SQL Server is running fine; I think the bottleneck is network related.

Network Administrator: The network doesn’t have problems. <next person to blame>

My goal is to isolate the problem, so when people are speculating about the potential problem it is kind of normal to see dialogs like these above.

Jokes aside sometimes we have customers calling us because of problems with web applications and they think the problem is the IIS. As far as I know IIS is the second most blamed product, just behind Windows itself.

Most of the time (I’m talking about 99% of the time) the problem is on the web application side, so not an IIS problem, the IIS is just a host for the web application.

Windows is blamed for similar reasons, I’ll explain more below.

8- With SharePoint applications and crashes, performance problems, hangs or memory issues: “You’re not a SharePoint engineer, we need a SharePoint engineer to help us.”

It doesn’t matter. If your application requires debugging you don’t need an engineer who knows how to use/install the product. You need an engineer that knows the internals of the applications and how to debug them. The good news is that this knowledge is not application dependent. SharePoint is, after all, like any other application from the debugging perspective.

Think about it, a Microsoft Engineer can debug your application even if he/she has never seen your application before. The same applies to our own products.

If at some point we isolate the problem to one of our products then we involve the engineer from that particular team because he/she has intimate knowledge of the problems and bugs from the product he/she supports.

9- “I just ran !clrstack and most threads running for a long time are trying to retrieve data from the database. The bottleneck is probably on the database side.”

Let me tell you something: I used to say to our new engineers or those who want to learn more about .NET Debugging, if you want to excel at .NET Debugging you must learn Native code debugging, which implies some knowledge of C/C++ programming too.

Don’t believe me? Ask your favorite bloggers that blog about .NET Debugging if they only know .NET debugging.

With that being said !clrstack is the favorite command from people learning .NET Debugging. It’s cool; you can see the managed side of the call stack which usually is higher level than the native side. However, sometimes you still need to see the native side or debug more on the native side than the managed side.

Here is a typical example where the customer thinks there is a potential bottleneck on the database side. The conclusion is based on several call stacks like this:

Managed call stack:

OS Thread Id: 0x610 (132)

ESP EIP

4cdde5b4 7c82860c [InlinedCallFrame: 4cdde5b4] <Module>.SNIReadSync(SNI_Conn*, SNI_Packet**, Int32)

4cdde5b0 0cebcd25 SNINativeMethodWrapper.SNIReadSync(System.Runtime.InteropServices.SafeHandle, IntPtr ByRef, Int32)

4cdde620 0cebc70a System.Data.SqlClient.TdsParserStateObject.ReadSni(System.Data.Common.DbAsyncResult, System.Data.SqlClient.TdsParserStateObject)

4cdde658 0cebc5ab System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()

4cdde668 0db5d058 System.Data.SqlClient.TdsParserStateObject.ReadBuffer()

4cdde674 0db5cfd6 System.Data.SqlClient.TdsParserStateObject.ReadByte()

System.Data.SqlClient.TdsParser.Run(System.Data.SqlClient.RunBehavior, System.Data.SqlClient.SqlCommand, System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.BulkCopySimpleResultSet, System.Data.SqlClient.TdsParserStateObject)

4cdde6ec 0de62f5a System.Data.SqlClient.SqlDataReader.ConsumeMetaData()

4cdde700 0de62e69 System.Data.SqlClient.SqlDataReader.get_MetaData()

System.Data.SqlClient.SqlCommand.FinishExecuteReader(System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.RunBehavior, System.String)

System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, Boolean)

4cdde7c8 0d26c7db System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String, System.Data.Common.DbAsyncResult) 4cdde80c 0d26cc65 System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String)

4cdde81c 0d26bb69 System.Data.dll!Unknown

4cdde85c 0d26afff System.Data.SqlClient.SqlCommand.ExecuteReader(System.Data.CommandBehavior, System.String)

4cdde8a0 0d26af5b System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(System.Data.CommandBehavior)

4cdde8a4 0d26af3b System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(System.Data.CommandBehavior)

4cdde8ac 0c77645b

. . .

. . .

. . .

A first look at the call stack may look like it is waiting for the database… but let’s see what the native side tells us.

Native call stack:

ChildEBP RetAddr

4cdde39c 7c827d29 ntdll!KiFastSystemCallRet

4cdde3a0 77e61d1e ntdll!NtWaitForSingleObject+0xc

4cdde410 79e8c639 kernel32!WaitForSingleObjectEx+0xac

4cdde454 79e8c56f mscorwks!CLREventWaitHelper+0x2f

4cdde4a4 79e8c58e mscorwks!CLREvent::WaitEx+0x117

4cdde4b8 79f74abb mscorwks!CLREvent::Wait+0x17

4cdde4cc 79f73e3a mscorwks!SVR::GCHeap::WaitUntilGCComplete+0x34

4cdde508 79ef6250 mscorwks!Thread::RareDisablePreemptiveGC+0x1b4

4cdde5a4 0cebcd3b mscorwks!JIT_RareDisableHelper+0x16

4cdde614 0cebc70a System_Data!SNINativeMethodWrapper::SNIReadSync+0xab

4cdde64c 0cebc5ab System_Data!System::Data::SqlClient::TdsParserStateObject::ReadSni+0x72

4cdde660 0db5d058 System_Data!System::Data::SqlClient::TdsParserStateObject::ReadNetworkPacket+0x5b

4cdde66c 0db5cfd6 System_Data!System::Data::SqlClient::TdsParserStateObject::ReadBuffer+0x28

4cdde678 0db5c5fd System_Data!System::Data::SqlClient::TdsParserStateObject::ReadByte+0x16

4cdde6d4 0de62f5a System_Data!System::Data::SqlClient::TdsParser::Run+0x6d

4cdde6f8 0de62e69 System_Data!System::Data::SqlClient::SqlDataReader::ConsumeMetaData+0x22

4cdde724 0de6280f System_Data!System::Data::SqlClient::SqlDataReader::get_MetaData+0x51

4cdde754 0d26cbe4 System_Data!System::Data::SqlClient::SqlCommand::FinishExecuteReader+0xcf

4cdde7b4 0d26c7db System_Data!System::Data::SqlClient::SqlCommand::RunExecuteReaderTds+0x3a4

4cdde7f4 0d26cc65 System_Data!System::Data::SqlClient::SqlCommand::RunExecuteReader+0xf3

4cdde850 0d26bb69 System_Data!System::Data::SqlClient::SqlCommand::RunExecuteReader+0x15

4cdde850 0d26afff 0xd26bb69

4cdde894 0d26af5b System_Data!System::Data::SqlClient::SqlCommand::ExecuteReader+0x8f

4cdde8a4 0d26af3b System_Data!System::Data::SqlClient::SqlCommand::ExecuteDbDataReader+0xb

4cdde8a4 0c77645b System_Data!System::Data::Common::DbCommand::System::Data::IDbCommand::ExecuteReader+0xb

4cdde8d0 0c7763c8

. . .

. . .

. . .

From the native side we can see the thread waiting for the Garbage Collector to finish. In fact, this specific application had Out of Memory exceptions and the memory was highly fragmented, so the Garbage Collector was spending more time than normal to compact the memory and finish the GC process.

In other words, the database side is not a potential problem here and we know that based on the native call stack. ;-)

Now compare the call stacks above with the call stacks below from another dump file which has, in fact, threads trying to retrieve data from the database, indicating a potential bottleneck on the database side.

Managed call stack:

Thread Id: 0x30e8 (50)

Child SP IP Call Site

1d71d60c 7779f871 [InlinedCallFrame: 1d71d60c]

1d71d608 6dee5026 DomainNeutralILStubClass.IL_STUB_PInvoke(SNI_Conn*, SNI_Packet**, Int32)*** WARNING: Unable to verify checksum for System.Data.ni.dll

1d71d60c 6ded261f [InlinedCallFrame: 1d71d60c] <Module>.SNIReadSync(SNI_Conn*, SNI_Packet**, Int32)

1d71d650 6ded261f SNINativeMethodWrapper.SNIReadSync(System.Runtime.InteropServices.SafeHandle, IntPtr ByRef, Int32)

1d71d690 6ded23b3 System.Data.SqlClient.TdsParserStateObject.ReadSni(System.Data.Common.DbAsyncResult, System.Data.SqlClient.TdsParserStateObject)

1d71d6c8 6ded22d4 System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()

1d71d6d8 6ded431f System.Data.SqlClient.TdsParserStateObject.ReadBuffer()

1d71d6e4 6ded42f6 System.Data.SqlClient.TdsParserStateObject.ReadByte()

1d71d6f0 6ded3a17 System.Data.SqlClient.TdsParser.Run(System.Data.SqlClient.RunBehavior, System.Data.SqlClient.SqlCommand, System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.BulkCopySimpleResultSet, System.Data.SqlClient.TdsParserStateObject)

1d71d74c 6decc8a2 System.Data.SqlClient.SqlDataReader.ConsumeMetaData()

1d71d760 6decc367 System.Data.SqlClient.SqlDataReader.get_MetaData()

1d71d78c 6decaaf1 System.Data.SqlClient.SqlCommand.FinishExecuteReader(System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.RunBehavior, System.String)

1d71d7c4 6deca721 System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, Boolean)

1d71d810 6deca513 System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String, System.Data.Common.DbAsyncResult)

1d71d854 6deca451 System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String)

1d71d874 6deca21e System.Data.SqlClient.SqlCommand.ExecuteReader(System.Data.CommandBehavior, System.String)

1d71d8b4 6deca03d System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(System.Data.CommandBehavior)

1d71d8b8 6decb3fb System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(System.Data.CommandBehavior)

1d71d8c0 6dedb079 System.Data.Common.DbDataAdapter.FillInternal(System.Data.DataSet, System.Data.DataTable[], Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)

1d71d914 6dedafa0 System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet, Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior) 1d71d958 6e257a1d System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet, System.String) 1d71d98c 0157daf4

. . .

. . .

. . .

Native call stack:

ChildEBP RetAddr

1d71d3e0 756d0816 ntdll!ZwWaitForSingleObject+0x15

1d71d44c 75ca1184 KERNELBASE!WaitForSingleObjectEx+0x98

1d71d464 75ca1138 kernel32!WaitForSingleObjectExImplementation+0x75

1d71d478 6dce9ec2 kernel32!WaitForSingleObject+0x12

1d71d5d4 6dcd8db3 System_Data!Tcp::ReadSync+0x187

1d71d5f4 6dee5026 System_Data!SNIReadSync+0x57

1d71d644 6ded261f System_Data_ni!DomainNeutralILStubClass.IL_STUB_PInvoke(SNI_Conn*, SNI_Packet**, Int32)+0x46

1d71d684 6ded23b3 System_Data_ni!SNINativeMethodWrapper.SNIReadSync(System.Runtime.InteropServices.SafeHandle, IntPtr ByRef, Int32)+0x4f

1d71d6bc 6ded22d4 System_Data_ni!System.Data.SqlClient.TdsParserStateObject.ReadSni(System.Data.Common.DbAsyncResult, System.Data.SqlClient.TdsParserStateObject)+0xa3

1d71d6d0 6ded431f System_Data_ni!System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()+0x24

1d71d6dc 6ded42f6 System_Data_ni!System.Data.SqlClient.TdsParserStateObject.ReadBuffer()+0x1f

1d71d6e8 6ded3a17 System_Data_ni!System.Data.SqlClient.TdsParserStateObject.ReadByte()+0x46

1d71d734 6decc8a2 System_Data_ni!System.Data.SqlClient.TdsParser.Run(System.Data.SqlClient.RunBehavior, System.Data.SqlClient.SqlCommand, System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.BulkCopySimpleResultSet, System.Data.SqlClient.TdsParserStateObject)+0x67

1d71d758 6decc367 System_Data_ni!System.Data.SqlClient.SqlDataReader.ConsumeMetaData()+0x22

1d71d784 6decaaf1 System_Data_ni!System.Data.SqlClient.SqlDataReader.get_MetaData()+0x57

1d71d7b4 6deca721 System_Data_ni!System.Data.SqlClient.SqlCommand.FinishExecuteReader(System.Data.SqlClient.SqlDataReader, System.Data.SqlClient.RunBehavior, System.String)+0xe1

1d71d7fc 6deca513 System_Data_ni!System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, Boolean)+0x151

1d71d83c 6deca451 System_Data_ni!System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String, System.Data.Common.DbAsyncResult)+0xa3

1d71d860 6deca21e System_Data_ni!System.Data.SqlClient.SqlCommand.RunExecuteReader(System.Data.CommandBehavior, System.Data.SqlClient.RunBehavior, Boolean, System.String)+0x21

1d71d8a8 6deca03d System_Data_ni!System.Data.SqlClient.SqlCommand.ExecuteReader(System.Data.CommandBehavior, System.String)+0x8e

1d71d8b8 6decb3fb System_Data_ni!System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(System.Data.CommandBehavior)+0xd

1d71d8b8 6dedb079 System_Data_ni!System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(System.Data.CommandBehavior)+0xb

1d71d8f4 6dedafa0 System_Data_ni!System.Data.Common.DbDataAdapter.FillInternal(System.Data.DataSet, System.Data.DataTable[], Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)+0x91

1d71d93c 6e257a1d System_Data_ni!System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet, Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)+0x140

1d71d980 0157daf4 System_Data_ni!System.Data.Common.DbDataAdapter.Fill(System.Data.DataSet, System.String)+0x5d

Conclusion: If you just analyze the managed side of the stack you may come up with the wrong conclusion. If you want to improve your .NET Debugging skills learn more about Native debugging.

Here is a list of books about .NET Debugging, User Mode Debugging and Kernel Debugging.

10- “My two servers are identical but the issue happens just on server XYZ.”

When troubleshooting scenarios like that never assume the servers are identical. Instead, prove it!

When I work on cases like this I like to run the MPSReport/SPSReport tool to collect all information from each server and compare them.

The only few occasions I had in which the servers were really identical when working on support cases were when just one of the servers was being accessed by the application, so it was being overloaded.

Lesson: Trust… but verify. J

11- “Remember that application crash I told you about? I’m uploading a Kernel Dump so you can analyze it.”

This is very similar to the “Too much information” problem discussed in item 5 above.

If your application is crashing you want to collect a dump file when the application crashes. If you collect a dump file any other time it won’t have information from the exception. If you force a huge Kernel Dump file to be collected you will end up with a huge dump file from all your machine’s processes but, again, that dump file won’t have information about the exception crashing your application.

Too much is not always better but having the right information collected during the right time is priceless. J

12- “From the Event Log I can see the exception that crashed my application and the call stack is pointing to Windows. I think this is a Windows bug.”

This is related to item 7 above and a common misconception. Sometimes calls stacks from 2nd chance exceptions (exceptions not handled by your application, thus crashing the app) have DLLs from Windows as the top frames. This is normal and it doesn’t mean that Windows is causing the crash.

Example:

 ChildEBP RetAddr  

0013bcd0 7c90de7a ntdll!KiFastSystemCall+0x2

0013bcd4 7c81cda6 ntdll!NtTerminateProcess+0xc <<< In yellow, NTDLL/Kernel32/MSCorWks/C Runtime.

0013bdd0 7c81cdfe kernel32!_ExitProcess+0x62 Don’t let these calls fool you.

0013bde4 79f944b0 kernel32!ExitProcess+0x14

0013c00c 79f2c09a mscorwks!SafeExitProcess+0x11b

0013c018 79eff585 mscorwks!DisableRuntime+0xd1

0013c0a8 79011628 mscorwks!CorExitProcess+0x242

0013c0b8 77c39d3c mscoree!CorExitProcess+0x46

0013c0c4 77c39e78 msvcrt!__crtExitProcess+0x29

0013c0d4 77c39e90 msvcrt!_cinit+0xee

0013c0e8 0e68d21e msvcrt!exit+0x12

0013c580 0e256834 testappl!FuTestInterface::init+0x34 <<< The application call where you should start the investigation.

0013c5a4 0e1d8c01 testapp!WBNARiskReportInterface::getResults+0x442a

0013c5b0 304972dc testapp!XLOPERStrLen+0x7297

Therefore don’t assume that ntdll or kernel32 caused the problem. The APIs from these Operating System dlls are being called as a consequence of the exception likely caused by the application. Try to identify the latest application method call as your initial investigation point. In our example above this is testapp!FuTestInterface::init. Analyze it and, if necessary, analyze the previous frame and so on.

13- “We collected dump files from that C++ application which is crashing. We think it is a heap corruption so the call stack should indicate the culprit, right?”

Wrong! Except if Page Heap was enabled.

This is not common nowadays because .NET applications are more and more common; however, back in the days of COM objects and C DLLs, heap corruption was a typical problem at that time.

Basically a heap corruption is caused when you have an application (think about C/C++) which allocates a memory buffer and at some point writes past the beginning (not common) or past the end of the buffer (very common).

Now, when you have a scenario like this your application is writing past the buffer size, so it usually doesn’t cause any immediate visible problems. Better said, it is causing problems but not automatically causing an exception.

So when this happens the overwritten memory may overwrite information from the heap header, which is used by the Heap Manager. That means when those affected blocks of memory are being accessed by the application you have a crash. However, the call stack you get is not the call stack that actually overwrote the heap but the call stack trying to access/release memory from the corrupted heap!

The trick to get call stacks from the method which actually corrupted the heap is to enable Page Heap, restart the application so it can use the new Heap Manager settings and collect a dump file. With this approach you can easily isolate the heap corruption problem.

The Page Heap can be enabled using different tools like PageHeap.exe, GFlags.exe, Application Verifier and others. Some Page Heap settings, like Full Page Heap, create a read only page after each memory allocation, so whenever your application tries to overwrite the buffer it hits the read only page and you get an Access Violation.

Here is a didactic explanation about using GFlags to isolate Heap Corruption problems.

Note: Windows Vista/Windows Server 2008 and Windows 7 have more mechanisms to easily detect heap corruptions and minimize them.

Note 2: In some situations even PageHeap won’t crash with the culprit on the call stack but those cases are rare. (Thanks MarioH J)

 

 

 

 

 

Comments
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment