Applies to Exchange 2003
I had a case a couple weeks ago I thought I'd write about. What was happening is the Version Store would run out of memory and a 623 error would throw. Version Store buckets allocated would climb from 4 to over 2000 in less than 5 minutes. The store would then rollback its transactions for a bit, recover, run for 10-15 minutes and repeat the whole cycle over.
This is atypical 623 behavior to say the least.
What we ended up doing to fix it was capture an adplus dump, 3 actually, triggered at Version Store buckets allocated crossing 1600. We captured 3 dumps at 1 minute intervals.
The 1st dump caught the problem transaction, the last 2 were both capturing rollbacks, so this was a quick ramp up.
Turns out the problem was being caused by a bad meeting request being processed over and over again. We tried all kinds of ways to delete the message, all of which caused Version Store buckets allocated to climb. A MFCMapi hard delete ended up doing the trick.
Dan and I and some other engineers wrote up a blog post you can find here on how to recover from a smashes schema scenario on your Exchange Servers.
It's pretty succinct so I don't have anything to add to it, it's an interesting read though.
Applies to: Exchange 2000/2003
This may seem like a basic thing to some people, but for those who don't know, here goes. This can be easily done by running Exmon, available here.
So download Exmon and fire it up on your Exchange Server.
Go to the By Clientmon tab, and in there you'll see a column named "Cached Mode Sessions". If you have something other than 0 in that field, then your user is connecting over Cached mode.
Hope that helps, I've had the question a few times before.
Ok, so if you have say, Citrix, or a standard image with Office pre-installed, then someone had to pick an Exchange server to point to for the Outlook profile creation wizard.So sometimes, in large organizations, teams don't necessarily speak to one another before they make small decisions like which server to point to. The person creating the Office install might pick, say, his home mail server.So when that mail server, years later, gets decommissioned, this can suddenly cause problems.How do you fix this?Simple! Glad you asked.2 things need to be done.1. Establish IP connectivity to the old server name. Easy enough, go into DNS and create a new A record for the old/missing Exchange server, with the IP of the server you'd like this task to point to.2. Go into ADSIEdit, find the computer object for the target server, right click and hit properties. Scroll down to ServicePrincipalName and edit. Add the following type of record:exchangeRFR/servernameGive that a little time to replicate around and voila, everything goes back to normal.Why is step 2 necessary? Kerberos security rearing it's ugly head. The target server needs to know it's acting as the old server or it will refuse connections.
Note that this is a possible work around and may cause corrupt MAPI profiles on your clients. The real fix here is to address the install, or clients configured to a server that no longer exists.
Applies to Exchange 2000, Exchange 2003, Exchange 2007.
So in Version Store 623 errors, Version Store gets 'clogged', if you will, and will fail to process transactions.
624 errors on the other hand, are caused by a lack of available virtual memory on the server. Sometimes this has no impact and the server corrects itself, but in a memory leak condition, this can be the sign your Exchange server is no longer accepting client connections and is in need of some assistance.
In the particular instance where I have seen this occur, the 624 event comes after a series of errors:
First we throw a MSExchangeDSAccess 2104 event.
Event ID : 2104Raw Event ID : 2104Record Nr. : 4802384Category : NoneSource : MSExchangeDSAccessType : ErrorGenerated : 9/7/2008 12:27:27 PMWritten : 9/7/2008 12:27:27 PMMachine : JAHUMBALABAHMessage : Process STORE.EXE (PID=636). All the DS Servers in domain are not responding.
Shortly thereafter you'll see a MSExchangeDSAccess 2102.
Event ID : 2102Raw Event ID : 2102Record Nr. : 4802387Category : NoneSource : MSExchangeDSAccessType : ErrorGenerated : 9/7/2008 12:28:15 PMWritten : 9/7/2008 12:28:15 PMMachine : JAHUMBALABAHMessage : Process MAD.EXE (PID=2588). All Domain Controller Servers in use are not responding:
JAHUMBALABAH-DC
Then we will see a MSExchangeSA 9152.
Event ID : 9152Raw Event ID : 9152Record Nr. : 4802391Category : NoneSource : MSExchangeSAType : ErrorGenerated : 9/7/2008 12:31:15 PMWritten : 9/7/2008 12:31:15 PMMachine : JAHUMBALABAHMessage : Microsoft Exchange System Attendant reported an error '0x8007000e' in its DS Monitoring thread.
This particular error is an out of memory error. Uh oh.
Then DSAccess has another problem.... a 9154.
Event ID : 9154Raw Event ID : 9154Record Nr. : 4802392Category : NoneSource : MSExchangeSAType : ErrorGenerated : 9/7/2008 12:31:20 PMWritten : 9/7/2008 12:31:20 PMMachine : JAHUMBALABAHMessage : DSACCESS returned an error '0x80004005' on DS notification. Microsoft Exchange System Attendant will re-set DS notification later.
This means a call failed, due to lack of memory...
Then the error you've all been waiting for, a 624 gets thrown by ESE.
Event ID : 624Raw Event ID : 624Record Nr. : 4802473Category : NoneSource : ESEType : ErrorGenerated : 9/7/2008 12:32:58 PMWritten : 9/7/2008 12:32:58 PMMachine : JAHUMBALABAHMessage : Information Store (636) Storage Group 1 (First Storage Group): The version store for this instance (1) cannot grow because it is receiving Out-Of-Memory errors from the OS. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.
Current version store size for this instance: 1Mb
Maximum version store size for this instance: 249Mb
Global memory pre-reserved for all version stores: 1Mb
Possible long-running transaction:
SessionId: 0xBD345AC0
Session-context: 0x00000000
Session-context ThreadId: 0x000015AC
Cleanup: 1
So what can cause this? Check your task manager. Do you see any handle leaks or processes with out of control handles? In the instance I saw for this, it was a mixture of stale messages stuck in the SMTP temp tables and a third-party AV scanner that had an apparent memory leak. Both Inetinfo and Store were over 2 gig and had 32k handles each. Once we resolved the issue Store was around 6k handles and Inetinfo around 3k.
What is happening is a memory leak is consuming all the virtual memory space in Store and Inetinfo, at least in our case here. Yours may differ in what is causing the leak, but I'd bet more than likely its going to be something that ties into Store, such as Anti-Virus, something gumming up IIS and then Epoxy, or something along those lines.
Because you run out of memory, DSAccess starts to fail, then you see the string of errors above.
If you see this, what should you do first and foremost? Give PSS a call so we can help you debug it.
More information on this can be found here:
http://technet.microsoft.com/en-us/library/bb218083(EXCHG.80).aspx
So one of the things that can go wrong with Exchange is that it can run out of something called Version Store. Version store is an in-memory list of changes made to the database. Nagesh Mahadev has an awesome post about Version Store on our msexchangeteam.com blog, posted here. To borrow his summary: In simple terms, the Version Store is where transactions are held in memory until they can be written to disk.
So version store running out of memory can be caused by either a long running transaction. This is pretty self explanatory. Say your anti-virus product wants to scan something in VSAPI and locks it and then goes to lunch. Your version store will consume more and more memory until it runs out because it's trying to work around this long running transaction, keeping track of all the rollbacks and whatnot.
The other problem is with I/O. Since we're holding transactions in memory until they can be written to disk, if something prevents us from writing to disk, we can hit version store problems. Sometimes this type of problem can be precipitated by 9791 event log entries in the application event log. If this happens, get ready to do some adplus store dumps when version buckets allocated hits 70%.
What to do to prevent or mitigate this risk?
Hope this helps with your environment.
Applies to: Windows 2000 Server/Advanced Server, Windows 2003 32bit Server, Exchange 2000/2003
PTEs
Ok, so one of the most overlooked resources we run into with performance and availability problems is the availability (or lack thereof) of Free Page Table Entries. What is a PTE? It's basically an I/O partition table, if you will. Wikipedia has an awesome link with 8x10 color glossy photos, with circles and arrows and a paragraph on the back explaining what each one is, so I'll point you there. Cliff Huffman also has an excellent post on PTEs here that specifically talks about Windows.
So anyway, running out of Free Table Entries is bad, because it causes system hangs, sporadic lock ups, general unresponsiveness, etc. These symptoms present themselves in Exchange as general slow performance or service unavailability.
You manage your available PTEs in Windows with the boot.ini and also the SystemPages registry key. Generally speaking for an Exchange Server that is properly configured, you'll see your PTE values somewhere between 8000-16000. A large number of PTEs (50k or so) may be a hint that you're not using the /3GB switch on your server. A lower value generally means there is a problem.
This problem can either be a configuration issue, or if the PTE value is falling, a memory leak.
If you are dealing with a static low value and you've examined all the configuration settings and they all seem fine, but the value is still low (flagging in the EXBPA for example), then add /basevideo to your boot.ini. The new agp/pci-e video drivers consume a lot of PTEs, and who needs the super-duper video card drivers on an Exchange box anyway?
If you are dealing with a leak, update your drivers for everything, NIC, HBA, Video, SCSI controller, you name it, update it. If you've done all that and still haven't gotten the leak addressed, contact PSS to get one of us involved with your case.
Handles
Another resource people don't usually pay much attention to is handle count. Excessive handle consumption can cause all kinds of non-paged kernel pool problems because they reside within that memory space.
If you have the symptoms of a memory leak but don't see what is causing it, check out the handle count in task manager. You can do this by going to the Processes tab and selecting View/Select Columns and selecting Handles. Handle usage varies by application and what it's doing at the time, but if you have an application with 100k handles open and your machine performance isn't the greatest, you may be dealing with a handle leak. If you are, your non-paged pool kernel memory may also be high but not showing anything eating it up in poolmon. This is because the handles don't appear to be taken into account on the poolmon monitor in some cases, so high consumption of handles by a resource don't end up under the process tag.
If you have a process with a high handle count, contact the vendor.
Documents on PTEs:
The effects of 4GT tuning on system Page Table Entries
How to Configure the Paged Address Pool and System Page Table Entry Memory Areas
Documents on Handles:
Well, here you can see the impact of high handle count:
Microsoft KB