Dude where's my PFE?

I was a Premier Field Engineer (PFE) for Microsoft.

Dude where's my PFE?

  • Repeating 623 version store error.

    Applies to Exchange 2003

    I had a case a couple weeks ago I thought I'd write about.  What was happening is the Version Store would run out of memory and a 623 error would throw.  Version Store buckets allocated would climb from 4 to over 2000 in less than 5 minutes.  The store would then rollback its transactions for a bit, recover, run for 10-15 minutes and repeat the whole cycle over.

    This is atypical 623 behavior to say the least.

    What we ended up doing to fix it was capture an adplus dump, 3 actually, triggered at Version Store buckets allocated crossing 1600.  We captured 3 dumps at 1 minute intervals.

    The 1st dump caught the problem transaction, the last 2 were both capturing rollbacks, so this was a quick ramp up.

    Turns out the problem was being caused by a bad meeting request being processed over and over again.  We tried all kinds of ways to delete the message, all of which caused Version Store buckets allocated to climb.  A MFCMapi hard delete ended up doing the trick.

  • How to fix smashed schema in Exchange 2003

    Dan and I and some other engineers wrote up a blog post you can find here on how to recover from a smashes schema scenario on your Exchange Servers.

     

    It's pretty succinct so I don't have anything to add to it, it's an interesting read though.

  • How to determine who is connecting to your server in cached mode.

    Applies to:  Exchange 2000/2003

     

    This may seem like a basic thing to some people, but for those who don't know, here goes.  This can be easily done by running Exmon, available here.

    So download Exmon and fire it up on your Exchange Server.

    Go to the By Clientmon tab, and in there you'll see a column named "Cached Mode Sessions".  If you have something other than 0 in that field, then your user is connecting over Cached mode.

     

    Hope that helps, I've had the question a few times before.

  • Spoof your old dead Exchange Server

    Ok, so if you have say, Citrix, or a standard image with Office pre-installed, then someone had to pick an Exchange server to point to for the Outlook profile creation wizard.

    So sometimes, in large organizations, teams don't necessarily speak to one another before they make small decisions like which server to point to.  The person creating the Office install might pick, say, his home mail server.

    So when that mail server, years later, gets decommissioned, this can suddenly cause problems.

    How do you fix this?

    Simple!  Glad you asked.

    2 things need to be done.

    1.  Establish IP connectivity to the old server name.  Easy enough, go into DNS and create a new A record for the old/missing Exchange server, with the IP of the server you'd like this task to point to.

    2.  Go into ADSIEdit, find the computer object for the target server, right click and hit properties.  Scroll down to ServicePrincipalName and edit.  Add the following type of record:

    exchangeRFR/servername

    Give that a little time to replicate around and voila, everything goes back to normal.

    Why is step 2 necessary?  Kerberos security rearing it's ugly head.  The target server needs to know it's acting as the old server or it will refuse connections.

     

    Note that this is a possible work around and may cause corrupt MAPI profiles on your clients.  The real fix here is to address the install, or clients configured to a server that no longer exists.

  • Version Store 624 events

    Applies to Exchange 2000, Exchange 2003, Exchange 2007.

    So in Version Store 623 errors, Version Store gets 'clogged', if you will, and will fail to process transactions.

    624 errors on the other hand, are caused by a lack of available virtual memory on the server.  Sometimes this has no impact and the server corrects itself, but in a memory leak condition, this can be the sign your Exchange server is no longer accepting client connections and is in need of some assistance.

    In the particular instance where I have seen this occur, the 624 event comes after a series of errors:

     

    First we throw a MSExchangeDSAccess 2104 event.

    Event ID     : 2104
    Raw Event ID : 2104
    Record Nr.   : 4802384
    Category     : None
    Source       : MSExchangeDSAccess
    Type         : Error
    Generated    : 9/7/2008 12:27:27 PM
    Written      : 9/7/2008 12:27:27 PM
    Machine      : JAHUMBALABAH
    Message      : Process STORE.EXE (PID=636). All the DS Servers in domain are not responding.

    Shortly thereafter you'll see a MSExchangeDSAccess 2102.

    Event ID     : 2102
    Raw Event ID : 2102
    Record Nr.   : 4802387
    Category     : None
    Source       : MSExchangeDSAccess
    Type         : Error
    Generated    : 9/7/2008 12:28:15 PM
    Written      : 9/7/2008 12:28:15 PM
    Machine      : JAHUMBALABAH
    Message      : Process MAD.EXE (PID=2588). All Domain Controller Servers in use are not responding:

    JAHUMBALABAH-DC

    Then we will see a MSExchangeSA 9152.

    Event ID     : 9152
    Raw Event ID : 9152
    Record Nr.   : 4802391
    Category     : None
    Source       : MSExchangeSA
    Type         : Error
    Generated    : 9/7/2008 12:31:15 PM
    Written      : 9/7/2008 12:31:15 PM
    Machine      : JAHUMBALABAH
    Message      : Microsoft Exchange System Attendant reported an error '0x8007000e' in its DS Monitoring thread.

    This particular error is an out of memory error.  Uh oh.

    Then DSAccess has another problem.... a 9154.

    Event ID     : 9154
    Raw Event ID : 9154
    Record Nr.   : 4802392
    Category     : None
    Source       : MSExchangeSA
    Type         : Error
    Generated    : 9/7/2008 12:31:20 PM
    Written      : 9/7/2008 12:31:20 PM
    Machine      : JAHUMBALABAH
    Message      : DSACCESS returned an error '0x80004005' on DS notification. Microsoft Exchange System Attendant will re-set DS notification later.

    This means a call failed, due to lack of memory...

    Then the error you've all been waiting for, a 624 gets thrown by ESE.

    Event ID     : 624
    Raw Event ID : 624
    Record Nr.   : 4802473
    Category     : None
    Source       : ESE
    Type         : Error
    Generated    : 9/7/2008 12:32:58 PM
    Written      : 9/7/2008 12:32:58 PM
    Machine      : JAHUMBALABAH
    Message      : Information Store (636) Storage Group 1 (First Storage Group): The version store for this instance (1) cannot grow because it is receiving Out-Of-Memory errors from the OS. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

    Current version store size for this instance: 1Mb

    Maximum version store size for this instance: 249Mb

    Global memory pre-reserved for all version stores: 1Mb

    Possible long-running transaction:

       SessionId: 0xBD345AC0

       Session-context: 0x00000000

       Session-context ThreadId: 0x000015AC

       Cleanup: 1

     

    So what can cause this?  Check your task manager.  Do you see any handle leaks or processes with out of control handles?  In the instance I saw for this, it was a mixture of stale messages stuck in the SMTP temp tables and a third-party AV scanner that had an apparent memory leak.  Both Inetinfo and Store were over 2 gig and had 32k handles each.  Once we resolved the issue Store was around 6k handles and Inetinfo around 3k.

    What is happening is a memory leak is consuming all the virtual memory space in Store and Inetinfo, at least in our case here.  Yours may differ in what is causing the leak, but I'd bet more than likely its going to be something that ties into Store, such as Anti-Virus, something gumming up IIS and then Epoxy, or something along those lines.

    Because you run out of memory, DSAccess starts to fail, then you see the string of errors above.

    If you see this, what should you do first and foremost?  Give PSS a call so we can help you debug it.

    More information on this can be found here:

    http://technet.microsoft.com/en-us/library/bb218083(EXCHG.80).aspx

     

  • Avoiding Version Store problems in the enterprise environment

    Applies to Exchange 2003 

      So one of the things that can go wrong with Exchange is that it can run out of something called Version Store.  Version store is an in-memory list of changes made to the database.  Nagesh Mahadev has an awesome post about Version Store on our msexchangeteam.com blog, posted here.  To borrow his summary:  In simple terms, the Version Store is where transactions are held in memory until they can be written to disk.

      So version store running out of memory can be caused by either a long running transaction.  This is pretty self explanatory.  Say your anti-virus product wants to scan something in VSAPI and locks it and then goes to lunch.  Your version store will consume more and more memory until it runs out because it's trying to work around this long running transaction, keeping track of all the rollbacks and whatnot.

      The other problem is with I/O.  Since we're holding transactions in memory until they can be written to disk, if something prevents us from writing to disk, we can hit version store problems.  Sometimes this type of problem can be precipitated by 9791 event log entries in the application event log.  If this happens, get ready to do some adplus store dumps when version buckets allocated hits 70%.

    What to do to prevent or mitigate this risk?

    1. Consider increasing transaction log buffers, especially if you are seeing transaction log stalls in your environment.  The logic here is that if store can't commit transactions to the log files fast enough, it can cause version store to back up.  By default the number of buffers is 500, you can increase this to 9000.  This will prevent a single database from needing to write a bunch of TLs at once and backing up version store.  I highly recommend using the EXBPA for governance on this, details on the rule for setting this, etc can be found here.
    2. Watch your PTE resources and treat accordingly.  I've seen customers run low on free PTEs and run into version store problems because they don't have the capacity to perform IO operations as fast as the database would like.
    3. Make sure your online maintenance is completing frequently, at least once a week on each database.  Part of online maintenance is defragmenting your database.  On a highly fragmented database(s) version store has to keep track unoptimized links and tables and dealing with records that are not on the fewest number of pages possible, in essence bloating version store size with each transaction.  For indepth information on Exchange Store Maintenance, go here.
    4. Keep your message size limits down.  Going hand in hand with this is preventing older Outlook clients from accessing your server.  Old clients (Older than Outlook 2003 SP2 in cached mode, any version of Outlook 2003 and higher for online mode) ignore your message size limits for submitting messages, so older clients could attach a 100 meg file and submit and store would have to deal with it even though it's over the size limit.  This should give you the gist of what I'm talking about here.

    Hope this helps with your environment.

  • PTE depletion, handle leaks and You

    Applies to:  Windows 2000 Server/Advanced Server, Windows 2003 32bit Server, Exchange 2000/2003

    PTEs 

    Ok, so one of the most overlooked resources we run into with performance and availability problems is the availability (or lack thereof) of Free Page Table Entries.  What is a PTE?  It's basically an I/O partition table, if you will.  Wikipedia has an awesome link with 8x10 color glossy photos, with circles and arrows and a paragraph on the back explaining what each one is, so I'll point you there.  Cliff Huffman also has an excellent post on PTEs here that specifically talks about Windows.

    So anyway, running out of Free Table Entries is bad, because it causes system hangs, sporadic lock ups, general unresponsiveness, etc.  These symptoms present themselves in Exchange as general slow performance or service unavailability.

    You manage your available PTEs in Windows with the boot.ini and also the SystemPages registry key.  Generally speaking for an Exchange Server that is properly configured, you'll see your PTE values somewhere between 8000-16000.  A large number of PTEs (50k or so) may be a hint that you're not using the /3GB switch on your server.  A lower value generally means there is a problem.

    This problem can either be a configuration issue, or if the PTE value is falling, a memory leak.

    If you are dealing with a static low value and you've examined all the configuration settings and they all seem fine, but the value is still low (flagging in the EXBPA for example), then add /basevideo to your boot.ini.  The new agp/pci-e video drivers consume a lot of PTEs, and who needs the super-duper video card drivers on an Exchange box anyway?

    If you are dealing with a leak, update your drivers for everything, NIC, HBA, Video, SCSI controller, you name it, update it.  If you've done all that and still haven't gotten the leak addressed, contact PSS to get one of us involved with your case.

    Handles

    Another resource people don't usually pay much attention to is handle count.  Excessive handle consumption can cause all kinds of non-paged kernel pool problems because they reside within that memory space.

    If you have the symptoms of a memory leak but don't see what is causing it, check out the handle count in task manager.  You can do this by going to the Processes tab and selecting View/Select Columns and selecting Handles.  Handle usage varies by application and what it's doing at the time, but if you have an application with 100k handles open and your machine performance isn't the greatest, you may be dealing with a handle leak.  If you are, your non-paged pool kernel memory may also be high but not showing anything eating it up in poolmon.  This is because the handles don't appear to be taken into account on the poolmon monitor in some cases, so high consumption of handles by a resource don't end up under the process tag.

    If you have a process with a high handle count, contact the vendor.

    Documents on PTEs:

    The effects of 4GT tuning on system Page Table Entries

    How to Configure the Paged Address Pool and System Page Table Entry Memory Areas

    Documents on Handles:

    Well, here you can see the impact of high handle count:

    Microsoft KB