• Improved PAL analysis for Exchange 2007

    I wanted to write a post regarding a lot of work that has gone into updating the Exchange Server 2007 PAL xml threshold files in order to make them more relevant and to more accurately report on Exchange Performance problems. This update couldn’t have been done without the help from Chris Antonakis who was one of the major contributors to all of these updates. Kudos to Chris on this one.

    There are some major updates that you need to be aware of when running the PAL tool to analyze Exchange performance problems and the Mailbox Role was the biggest change on how to look at things.

    Shown below is the selection for the Mailbox Role threshold file which includes a few new questions. These questions will help break down performance problems specific to database drives, log file drives and pagefile drives in the resultant report. Previously, this was an all encompassing generic analysis which didn’t really give you the full picture of actual bottlenecks as there are latency differences between the database and log file drives.

    image

    Adding Database Drive letters is quite easy, and gathering the data for this input can be collected from various areas such as ExBPA and in the BLG file itself. These drive letters could also include Volume Mount Points.

    If you know the drive letters already, then that is great. Let’s say your database drives were on Drive E:, Drive F:, and Drive G:, you would need to enter them separated by a semicolon such as E:;F:;G: as shown in the screenshot above. You would also need to do this for the Log File Drives and the Page File Drives for a more accurate analysiss

    Using an ExBPA report of the server and the Tree Report view would be the best way to get the drive letter and volume mount point information, but sometimes a BLG file may provide enough information regarding volume mount points based on the naming convention that was used (keep in mind though that although the volume mount point is named “<Drive Letter:>\Logs” it may actually contain database files or no files at all). A screenshot below shows the Logical Disk counter that shows the volume mount point names. Unfortunately we don’t have a scripted way to pull the data out of the blg file at this time, so this is a manual process.

    image

    For the above information, assuming all the _DATA volume mount points contained Exchange databases, you would start entering data in the question as the following:

    S:\SG01_SG04_DATA;S:\SG05_SG08_DATA;S:\SG09_SG12_DATA

    You get the idea… Just remember that all drives and mount points need to be separated by a semicolon and you should be good.

    Now it’s important to note that we have included a catch all Generic Disk analysis for incase any of the drive questions were not answered. So, if you ran a report and forgot to enter any drive information in, you will get an output similar to the following in the Table of Contents. This may lean you towards an actual disk related problem due to the amount of times an analysis crossed over a threshold. You will see that there were 527 disk samples taken in this perfmon and all Database, Log and Page file drives have the same alert count in them. It is actually normal that this is happening because we will now log a tripped threshold for each drive type specific analysis and have fallen through to the Generic Disk analysis. If you see this, then go directly to the Generic analysis to look at your disk analysis.

    image

    For each one of the thresholds that tripped in which drive letters were not entered, you will see an entry in the table similar to the following stating that no data was entered in the questions. You can either ignore this and view the Generic Disk analysis or re-run the analysis with the questions correctly answered, providing a more accurate analysis.

    image

    The same holds true for the Hub Transport and Client Access server disk analysis.

    Another question that was added to the Mailbox server role analysis was ClientMajority which specifies if the majority of the clients are running in cached mode or not. This setting directly affects the analysis of the MSExchange Database(Information Store)\Database Cache % Hit counter.

    image

    Database Cache % Hit is the percentage of database file page requests that were fulfilled by the database cache without causing a file operation, i.e. not having to read the Exchange database to retrieve the page.  If this percentage is too low, the database cache size may be too small.

    Here are the thresholds that were added for this particular analysis.

    • WARNING - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 90%
    • ERROR - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 75%
    • WARNING - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 99%
    • ERROR - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 85%

    The last question that was added was CCRInUse. This question helps differentiate analysis for CopyQueueLength and ReplayQueueLength between CCR and LCR replication since we have different recommended values for each configuration.

    image

    There was also an update for the HUB and HUB/CAS role threshold files where you can now specify drive information for both the Exchange Transport queue file drives and the Page File Drives.

    image

    Additionally the 64bit question was removed from all the Exchange Server 2007 PAL threshold files, since Exchange 2007 is only supported in production on a 64bit Windows operating system.

    It’s probably also important to point out that we’ve managed to get all of the thresholds corrected and updated and a number of new analysis rules added however we haven’t necessarily managed to update or include all of the associated rule and troubleshooting text that goes with each analysis rule. As we get some more time these will be updated, for now it will be more important to migrate all the PAL 1.0 Exchange content to the new PAL 2.0 that will be available sometime in the near future.

    To download the latest XML files, go the XML update page here or direct download here

    If you are interested in the other changes that were made to the 3 threshold files here they are below:

    MBX:

    • Change RPC slow packets (>2s) more than 0 to only trigger on average value as per online documentation.
    • Updated RPC Average Latency to warn on 25ms average (as per online guidance), warn on 50ms max and critical on 70ms max or average.
    • Added MSExchangeIS\RPC Client Backoff/sec to warn on greater than 5.
    • Modified MSExchangeIS Client: RPCs Failed: Server Too Busy to only create a warning for greater than 50 and removed the error alert for greater than 10 seeing as this counter is mostly useful to know if Server Too Busy RPC errors have ever occurred (since it is calculated since store startup)
    • Modified MSExchangeIS\RPC Requests to warn on 50 instead of 70 as higher than 50 is already too high and to then error on 70.
    • Removed the MSExchangeWS\Request/Sec counter from Web Related as MSExchangeWS does not exist on a MBX server.
    • Added _Total to instance exclusions for disk analysis.
    • Added _Total to instance exclusions for MSExchange Search Indices counters.
    • Added _Total to instance exclusions for various other counters.
    • Created a generic disk analysis for when either the log drives, database drives or pagefile drives is unknown.
    • Added in a warning alert threshold for Calendar Attendant Requests Failed when it is greater than 0.
    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Updated MSExchange Database(Information Store)\Version buckets allocated to alert on greater than 11468 instead of 12000 i.e. 70% of 16384.
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Added a content indexing section for the Exchange 2007 indexing counters
    • Added analysis for ExSearch processor usage to warn on more than 1% and error on more than 5%
    • Added analysis for MSFTEFD* processor usage to warn on using more than 10% of the Store.exe processor usage
    • Updated .Net CLR Memory\% Time in GC to include * for process and exclude _Global. Removed 5% threshold and made 10 and 20% threshold warning and error conditions respectively.
    • Updated MSExchange Replication\ReplayQueueLength and CopyQueueLength Counters to exclude _Total
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB:

    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Removed Process\%Processor Time from the Process category as it is already included as part of Processor\Excessive Processor Use By Process
    • Modified Memory\Available MBytes to warn on less than 100MB and critical on less than 50MB
    • Added threshold alerts for Memory\% Committed Bytes in Use to warn on greater than 85% and critical on more than 90%
    • Added Memory\Committed Bytes
    • Corrected Memory\Pages Input/Sec to warn on greater than 1000 as it was set to warn on greater than 10
    • Added threshold alert for Memory\Pages Output/Sec to warn on greater than 1000
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Modified Memory\Transition Pages Repurposed/Sec to warn on spikes greater than 1000 instead of 100
    • Modified Memory\Transition Pages Repurposed/Sec to critical on averages greater than 500 instead of 1000
    • Modified Memory\Transition Pages Repurposed/Sec to critical on spikes greater than 3000 instead of 1000
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchangeTransport Queues(_total)\Messages Queued for Delivery Per Second
    • Removed all MSExchangeMailSubmission Counters as they are only on MBX
    • Removed MSExchange Database ==> Instances Log Generation Checkpoint Depth - MBX as this was for MBX role
    • Modified MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\Log Threads Waiting to warn on greater than 10 and error on 50
    • Added an error alert for MSExchange Extensibility Agents(*)\Average Agent Processing Time (sec) to error on greater than 60 average
    • Collapsed all Database counters under MSExchange Database category
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Moved Process(EdgeTransport)\IO* counters into EdgeTransport IO Activity category
    • Updated MSExchange Database(*)\Database Page Fault Stalls/sec to MSExchange Database(edgetransport)\Database Page Fault Stalls/sec
    • Updated MSExchange Database ==> Instances(*)\I/O Database Reads Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Reads Average Latency
    • Updated MSExchange Database ==> Instances(*)\I/O Database Writes Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Writes Average Latency
    • Added _Total exclusions where necessary
    • Removed 64bit question
    • Added a question for pagefile drive
    • Added edgetransport as an exclusion to Memory\Memory Leak Detection
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated .Net\CLR Memory\%Time in GC to include * for process and exclude _Global. Removed 5% threshold and made and 20% threshold warning and error conditions respectively.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds 

      CAS:

    • Created a new CAS file based off of the common updates in the new MBX xml
    • Updated ASP.NET\Request Wait Time to warn on greater than 1000 max and error on 5000 max
    • Updated ASP.NET Applications(__Total__)\Requests In Application Queue to error on 3000 rather than 2500
    • Updated MSExchange Availability Service\Average Time to Process a Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange Availability Service\Average Time to Process a Cross-Site Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange OWA\Average Response Time to warn on max greater than 100 and more than 2 OWA requests per second on average
    • Updated MSExchange OWA\Average Search Time to warn on max greater than 31000
    • Updated MSExchangeFDS:OAB(*)\Download Task Queued to warn on avg greater than 0
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated ASP.Net Requests Current to warn on greater than 1000 and error on greater than 5000 (max size it can get to is 5000 before requests are rejected)
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB/CAS:

    • Combined both HUB and CAS XMLs for analysis of combined roles.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds
  • The case of the slow Exchange 2003 Server – Lessons learned

    Recently we received a case in support with an Exchange 2003 server where message delivery was slow and the Local Delivery queue was getting backed up. The Local Delivery queue was actually reaching in to the two thousand range and would fluctuate around that number for extended periods of time.

    So we collected some performance data and all RPC latencies, disk latencies, CPU utilization and many of the other counters that we looked at did not show any signs of any problems. <Scratching Head>

    This is actually a common problem that I have seen where the server is responding OK to clients and everything else appears to be operating normally except for the local delivery queue that continually rises. Even disabling any Anti-virus software on the server including any VSAPI versions does not resolve the problem. So we essentially have a case of a slow Exchange server with no signs of performance degradation using any normal troubleshooting methods.

    The reason may not seem apparently obvious, but let me show you what this common problem is that I have seen in these situations. This not only applies to Exchange 2003, but it also applies to later versions of Exchange.

    In some companies, they need to be able to journal messages to holding mailboxes either on the same server or a different server to maintain a copy of all messages that are sent in the organization for compliance purposes. These journaling mailboxes can get quite large and requires a special level of attention to ensure that the mailbox sizes and item counts for those mailboxes are maintained within reasonable levels. They kind of defy what our normal recommendations/guidance states because item counts in these folders can surely reach tens of thousands of items rather quickly and depends on the amount of mail that is sent within your organization.

    Generally, the special level of attention needed that I mentioned earlier for journaling mailboxes are often overlooked. For each journaling mailbox, you need to have a process that will not only back up the items in these folders, but you need to also have some process that goes in and purges the data out of the mailbox once the backup has been taken. This purging process is necessary to maintain acceptable performance levels on an Exchange server. If these mailboxes are on their own server, user mailboxes are not normally affected. If these mailboxes are on the same server as user mailboxes, then this is where you might run in to some problems.

    In this case that we received, we had found a journaling mailbox that had almost 1.5 million items in the mailbox that was 109GB in size as shown in the below screenshot. Wow!! That is a lot of items in this one mailbox.

    huge journal mailbox-fixed

    If you tried to logon to this mailbox using Outlook, the client would most likely hang for 5-10 minutes trying to query the amount of rows in the message table to generate the view that Outlook is trying to open. Once this view is created, you should now be able to view the items and then get back control of the Outlook client. You may think that you could simply go in and start removing/deleting items from this mailbox to start lowering the overall size of the mailbox. Try as you must, but you will most likely end up trying to do this for days since the performance impact of this amount of items in the mailbox will make this a very painful process. Making any modifications to the messages in these folders will cause the message tables to be updated which for this amount of items is simply going to take an exorbitant amount of time.

    Our standard recommendation for Exchange mailboxes on Exchange 2003 servers is to have item counts under 5,000 items per folder. This guidance can be found in the Understanding the Performance Impact of High Item Counts and Restricted Views whitepaper here.

    A simple troubleshooting step would be to dismount the mailbox store that this mailbox resides in to see if the message delivery queues go down. If all of the queues flush for all other mailbox stores, you have now found your problem.

    If you absolutely need to get in to the mailbox to view some of the data, an Outlook client may not be the way to go to do some housecleaning. An alternative would be to use the MFCMAPI tool to view the contents of the mailbox. MFCMAPI will allow you to configure the tool to only allow a certain number of items to be returned at any given time. If you pull up MFCMAPI’s options screen, you can change the throttling section to limit the amount of rows that are displayed. If you were to put 4800 items in the highlighted section below, you would essentially limit the amount of rows or messages that are queried when the folder is opened to the number that you have entered. This will make viewing some of information a little bit easier, but still would be very cumbersome.

    clip_image002

    There are a couple of workarounds that you can do to clean this mailbox out.

    • If the data in the mailbox is already backed up, you could disable mail for that mailbox, run the cleanup agent and then create a new mailbox for the user. Note: the size of the database will still be huge and will increase backup and restore times even if you should recreate the mailbox. If you are finding that the backup times are taking a long time, you may want to think about using the dial tone database in the next suggestion or possibly moving the mailboxes on this store to a new database AFTER you have cleaned out the problem mailbox and then retiring the old database.
    • If the Mailbox Database houses only this one mailbox, you could simply dial tone that database starting with a fresh database. Instructions on how to do this can be found here
    • Purging the data out the mailbox using Mailbox Manager or some 3rd party tool may work, but keep in mind that you will most likely experience a performance problem on the server while the information is cleaned out of the mailbox and could take possibly hours to run

    Long live that 109GB/1.5million item mailbox!!! :)

    Another way to possibly find the high item count user is to use the PFDavAdmin tool to export items counts in users mailboxes. Steps on how to do this can be found here.

    These cases are sometimes very tough to troubleshoot as any performance tool that you might try to use to determine where the problem might lie would not showing anything at the surface. Since the Exchange server is still responding to RPC calls in a timely fashion, any expensive calls running such as a query rows operation will surely slow things down. If you see that things are slow on your Exchange 2003 server and perfmon does not show anything glaring, one of the first things that I check is item counts in users mailboxes looking for some top high item count offenders. Exchange 2007 can have other reasons for this slowness, but that would be another blog post in and of itself.

    So the moral of the story here is that should you have large mailboxes in your organization that are used as a journaling mailbox, a resource mailbox, or some type of automatic email processing application that might make use of Inbox rules to manipulate data in the mailbox, then you need to be absolutely sure that if the mailboxes are backed up or not, that the item counts in the folders of these mailboxes need to be kept to a reasonable count size or they will bring an Exchange server to crawling mode in trying to process email.

    Just trying to pass on some of this not so obvious information…….

  • How to fix/repair broken Exchange 2007 counters

    I commonly get calls on the inability to see performance counters in Performance Monitor (perfmon) and the inability to query them through WMI. I thought I would take some time to write about how to look for any problems with Exchange Performance Counters and then provide some high level insight on how to possibly fix them. Most of this information applies to Windows 2003 servers.

    If the counters are not being shown at all, the first place to check is the registry to see if the counters are not disabled. Here is a snippet of what one of the registry keys would look like

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    If you also see a value of Disable Performance Counters in addition to the above default entries and is set to a nonzero value, the counter at one point had a problem loading and the Operating System disabled them for whatever reason. Set the value to 0 and then close and open Perfmon again to see if you can see the counters again. More information on this Disable Performance Counters setting can be found here . If this works for you, then whew, that was an easy one….

    If the Performance key is missing for a particular service, then we have bigger problems. I am not sure what causes this key to get removed, but if the key is not there, Perfmon or WMI does not know how to load the counters. There are a couple of key required parts that you need to understand before we can load any Performance counter, not just Exchange. The key pieces that are needed to reload any Performance counter is the following:

    • Performance key must be created under the specified service
    • Library path must be specified to the appropriate DLL for the service
    • A PerfIniFile must be specified which is the name of the ini file that will reload a specific services performance counters
    • Lastly, we need to have the Close, Collect, and Open values which specify what method is used to retrieve the Performance Counter Data. Note: This is unique to each service, so they will not always have the same information

    If we have these key pieces of information in the registry, we have the ability to reload said services performance counters. If we take the example above for ESE, if we opened a command prompt and navigated to the C:\Program Files\Microsoft\Exchange Server\bin\perf\AMD64 directory and then typed lodctr eseperf.ini, this will reload the counters for ESE. If the counters were loaded successfully, we should now see that we have also added the First Counter, First Help, Last Counter, Last Help values as shown below.. These values correspond to specific data that was loaded in to the Perflib library.

    image

    If everything went well and you reopen Perfmon, You should now hopefully see the counters loaded. If they have not loaded, refresh the registry to see if the Disable Performance Counters key shows back up, If not, check the application log for Perflib errors which should provide additional information regarding why these counters did not load successfully.

    If you don’t know already, on Windows 2003 servers, you can actually pull up performance counters using the command Perfmon /WMI. If you do not see the newly added counters, then they have not been synchronized with the WMI repository yet. To help force this along, you could run wmiadap /f to force the reload of all counters in to the WMI repository.

    If this was successful, you will now see some additional Wbem entries as shown in the below pictorial.

    image

    Pulling up Perfmon /WMI again should hopefully show the counters that you are looking for. In some cases, monitoring software can still not pick up the newly added counters until the WMI service (Windows Management Instrumentation) has been restarted.

    If you ever wanted to unload Performance counters, one might think that you could simply unload the counters by running unlodctr eseperf.ini. Unfortunately, this will not work because the unlodctr utility requires that a service name be passed in instead of the ini file. To find the actual name of the service, you could simply open up eseperf.ini and at the top of the file, you should notice an entry similar to the following

    [info]
    drivername=ESE

    Ahh, there is the service name. Now if I run unlodctr ESE, this will now be successful. Doing this will remove the First Counter, First Help, Last Counter, Last Help values from the registry.

    Hopefully you are still with me at this time. Now what happens if the performance registry keys for all of your services went missing, now what do you do? Reinstall, flatten the box and reinstall to get them back? Well, unfortunately, there is not a direct way of recreating these registry keys as they are created during the installation of Exchange.

    The majority of the folks just export the data from another server, clean out any of the data that references performance counter data from the old server and then import them on the affected server. This does in fact work and is what I am going to talk about next on how to recover from a complete Performance key meltdown.

    Attached to this post is a zip file that contains all of the Performance keys across various different role combinations such as MBX, CAS, HUB, HUB/CAS, HUB/CAS/MBX. I’ve done all of the dirty work for you, so all you have to do is to perform some very simple modification steps to the files and then you are in business.

    CAUTION!!!: DO NOT IMPORT these registry keys if the Performance registry keys already exist as it will overwrite the data that currently exists in the registry and could potentially break your Performance counters that are currently working. If you only need to reload the Performance key for a single service, then pull out the data for that specific service, save it to a reg file and then import only that data. Basically use it as a reference point to help get you back running again.

    If you feel the need to use these reg import files due to all of the performance keys missing for all services, simply open the file that pertains to the role that you have installed and verify that the paths are correct to the correct library files. By Default, we install Exchange in to the to c:\program files\microsoft\Exchange Server directory, so if Exchange was installed outside of the default directory, you will need to update the file manually. Let’s take the ESE performance key below:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    Here you will see that library has the following value:

    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"

    What you will need to do is to replace the path with the correct path in which you have installed Exchange. If you installed Exchange on D: in the following directory (D:\\Program Files\\Microsoft\\Exchange Server\\bin), you would simply need to modify the first part of the path to show D:\\ instead of C:\\. A quick find and replace should work to hit all Performance keys. If you have installed it in to another Directory outside of the default paths, then you have a little more work to do to replace the path information. Just remember that for each backslash (\), you have to include double-backslashes (\\) to allow for proper importing of the reg files.

    There are only a handful of entries you have to manually modify, so this really shouldn’t take too long. Once you have the paths changed, save the appropriate file as a .reg file and import it by double-clicking on the file. Verify the Performance reg keys are good and valid by opening the Registry Editor to verify.

    Once the keys have been verified in the registry and look good, you can then run the powershell script to reload all of the Exchange performance counters. Simply copy the ReinstallAllPerCounters.pst.txt file to the Exchange server and then remove the .txt extension on the file. Open the Exchange Management Shell and then run the script. The screenshot below shows each ini file attempting to be loaded. Of course, on my server, I already had all of the performance keys, so we simply reported that the counters were already installed.

    clip_image002[6]

    Note: If you would like to transfer this data to WMI, simply type Y when asked.

    Once this has completed, be sure to check the application event log for details on any counters that failed to load. If everything went well, voila, you should have most if not all of your Exchange Performance Counters back once again.

    If the counters are still not showing up for whatever reason in WMI, you can run the following 2 commands to clear the WMI Adap cache and then re-sync the counters again to hopefully kick start things once again.

    See http://msdn.microsoft.com/en-us/library/aa394525(VS.85).aspx for more information on some of the additional commands included with the winmgmt command.

    Hopefully this will help you out trying to get your Exchange performance counters going once again.

  • How to monitor and troubleshoot the use of Nonpaged pool memory in Exchange Server 2003 or in Exchange 2000 Server

    This article is a high level overview on how to troubleshoot current Nonpaged pool memory usage on an Exchange server.  It explains what could be done to help mitigate some of the underlying problems that may be consuming Nonpaged pool memory and demonstrates tools that can be used to help track down processes or drivers consuming the most amount of memory.

    Nonpaged pool memory is a limited resource on 32-bit architecture systems.  It is dependent on how the server is setup to manage memory and is calculated at system startup. The amount of nonpaged pool allocated on a given server is a combination of overall memory, running/loaded drivers and if the /3GB switch has been added to the boot.ini file.

    Nonpaged pool memory is used for objects that cannot be paged out to disk and have to remain in memory as long as they are allocated. Examples of such objects may be network card drivers, video drivers and Antivirus filter level drivers.  By default, without the /3GB switch, the OS will allocate 256MB of RAM on a server for a Nonpaged pool. When the /3GB switch is added and the server is rebooted, this essentially halves the amount of Nonpaged pool memory on a given server to 128MB of RAM. The Windows Performance team has a table listed in http://blogs.technet.com/askperf/archive/2007/03/07/memory-management-understanding-pool-resources.aspx that discusses what the max pool memory resources can be on any given server. This link also disusses how to view the maximum amount of pool memory on any given server using Process Explorer. For Exchange servers, it is recommended to add the /3GB switch to the boot.ini file with the exception of pure HUB or Front End (FE) servers to allocate more memory to the user processes. As you can see, this limits how much you can load within that memory space. If this memory has been exhausted, the server will start becoming unstable and may become inaccessible. Unfortunately, since this memory cannot be paged in and out, you cannot resolve this problem without rebooting the server.

    On Microsoft Windows 2003 64-bit operating systems, the Kernel Nonpaged pool memory can use as much as 128GB depending on configuration and RAM. This essentially overcomes this limitation. See 294418 for a list of differences in memory architectures between 32-bit and 64-bit versions of windows. Currently, the only version of Exchange that is supported on a 64-bit operating system is Exchange 2007, so when working with previous versions of Exchange we may still run into this Nonpaged pool limitation.

    Symptoms

    When Nonpaged pool memory has been depleted or is nearing the maximum on an Exchange Server, the following functionality may be affected because these features require access to HTTP/HTTPS to function:

    1. Users connecting via Outlook Web Access may experience “Page cannot be displayed” errors.

      The issue occurs when nonpaged pool memory is no longer sufficient on the server to process new requests.  More information on troubleshooting this issue is available in the following KB article:
      Error message when you try to view a Web page that is hosted on IIS 6.0: "Page cannot be displayed"
      http://support.microsoft.com/?id=933844

      Note: If this resolves your OWA issue, it is recommended to determine what is consuming nonpaged pool memory on the server. See the Troubleshooting section of this document for help in determining what is consuming this memory.
    2. RPC over HTTP connections are slow or unavailable.

      If you experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003 it can indicate a depletion of Nonpaged pool memory.  HTTP.sys stops accepting new connections when the available nonpaged pool memory drops under 20MB.  More information on troubleshooting this issue is available in the following KB article:

      You experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003
      http://support.microsoft.com/?id=924047
    3. The IsAlive check fails on Cluster

      The cluster IsAlive checks for the Exchange HTTP resource on a cluster server may fail causing service outages or failovers. This is the most common scenario that we see for Exchange 2003 clusters. When there is less than 20MB of nonpaged pool memory, http.sys will start rejecting connections affecting the IsAlive check.

      When nonpaged pool is becoming exhausted, the IsAlive check fails causing the resource to fail. Depending on your recovery settings for the HTTP resource in Cluster Administrator, we will try to either restart the resource or fail over the group. By default, we will try restarting the resource 3 times before affecting the group. If this threshold is hit, the entire group will fail over to another cluster node.
      To verify if nonpaged pool has been depleted, you can look in 2 possible locations. One is the cluster.log file and the other is the httperr.log

      Cluster.log
      For the cluster.log file, you may see an entry similar to the following:

      00000f48.00000654::2007/05/16-17:16:52.435 ERR Microsoft Exchange DAV Server Instance <Exchange HTTP Virtual Server Instance 101 (EXVSNAME)>: [EXRES] DwCheckProtocolBanner: failed in receive. Error 10054.

      Error 10054 is equivalent to WSAECONNRESET which is http.sys rejecting the connection.

      Httperr.log
      In the httperr.log that is located in the %windir%\system32\logfiles\httperr directory on the Exchange Server, you may see entries similar to the following.

      2007-05-16 16:44:56 - - - - - - - - - 1_Connections_Refused -
      2007-05-16 16:50:42 - - - - - - - - - 3_Connections_Refused -
      2007-05-16 16:50:47 - - - - - - - - - 2_Connections_Refused -
      2007-05-16 17:16:35 - - - - - - - - - 5_Connections_Refused –

      This confirms that http.sys is rejecting the connection to the server. Additional information regarding this logging can be found in the following article:

      Error logging in HTTP API
      http://support.microsoft.com/?id=820729

      Additional information for this issue is available in the following KB:

      Users receive a "The page cannot be displayed" error message, and "Connections_refused" entries are logged in the Httperr.log file on a server that is running Windows Server 2003, Exchange 2003, and IIS 6.0
      http://support.microsoft.com/?id=934878
    4. Random Server Lockups or Hangs
    5. Certain operations failing because of the lack of memory to support new operations.
      Check the Application and System logs where common operations might be failing.
    Potential Workaround to provide immediate/temporary relief

    If immediate relief is needed for all these scenarios to prevent these rejections from occurring on a cluster server, then you can add the EnableAggressiveMemoryUsage registry key on the server for temporary relief. When this is added, http.sys will then start rejecting connections when there is less than 8MB of Nonpaged pool memory available, overriding the 20MB default value. See 934878 for more information on setting this key. Note:  Please use this as a temporary method to get the Exchange cluster resources back online and investigate the underlying cause of who is taking up the most amount of Nonpaged pool memory on the server. An ideal situation would be having 100MB or less of overall Nonpaged pool memory consumed on any given server.

    Nonpaged Pool Memory Depletion events

    When pool memory has been depleted, you may start receiving the following error in the System Event log stating that a specific pool memory has been depleted.

    Event ID 2019
    Event Type: Error
    Event Source: Srv
    Event Category: None
    Event ID: 2019
    Description:
    The server was unable to allocate from the system NonPaged pool because the pool was empty.

    If you are getting these events, then the server is most likely very unstable currently or will be very soon. Immediate action is required to bring the server back online to a fully functional state such as moving the cluster resources to another node or rebooting the server that has this problem.

    Troubleshooting

    There are a couple of different ways to view the amount of real-time pool memory usage that is currently being consumed and the easiest one is Task Manager. Once you pull up Task Manager, you will need to click the Performance tab and in the lower right hand corner, you will see the amount of pool memory usage that is highlighted. If nonpaged pool is 106MB or more, then there is a possibility that the cluster IsAlive checks for the HTTP resource are failing or close to failing.

    image

    You can also view Nonpaged and Paged Pool usage per process on the Processes tab in Task Manager. I’ve added the Paged Pool column since the same basic rules applies there too. To do this, select the Processes tab, select View on the menu, and then Select Columns. Add Non-paged Pool, Paged Pool, and the Handles columns as shown below.

    image

    Once this column is added, you can now view pool usage per process which may help you track down what process is consuming the most amount of memory. You can sort each column to look for the highest consumer. The handle column is added to help determine if there is any process that may have a large amount of handles consuming a larger amount of nonpaged pool memory. (Note: A high handle count may affect either paged or nonpaged pool memory, so keep this in mind when analyzing data.) 

    image

    Another way of looking at handles for any given process is to use Process Explorer available here.  To add the handle count column, you would select View on the menu, then “Select Columns”, click the Process Performance tab, and then put a check box next to “Handle Count”. Click OK.

    image

    If you can’t determine from there what is consuming the memory, this may be a kernel related problem and not application specific. This will require some additional tools to determine what could be affecting the nonpaged pool memory.

    One of the first things to look for is drivers that are more than 2 years old that may have had some issues in the past, but have been resolved with later driver releases. Running the Exchange Best Practices analyzer tool (ExBPA) located here can help report any drivers that may be outdated or have been known to have issues previously. If ExBPA did not report any problems with the configuration of the server or any driver related problems, further troubleshooting is necessary.

    If the Windows Support tools are installed, you can use a tool called Poolmon to allow you to view what specific tags are consuming memory. More information regarding Poolmon can be found in the Windows Support Tools documentation here.  To run Poolmon, simply open up a command prompt and type “Poolmon” and then hit the “b” key to sort on the overall byte usage (Bytes) with the highest being at the top. Anything you see that is highlighted means that there was a change in memory for that specific tag.

    In this view, you want to look at the top five consumers of memory which should be listed at the top. For the most part, you will be looking at the first two columns named Tag & Type.  The Tag is specific to a particular driver and the Type column indicates what type of memory is being used, nonpaged pool (Nonp) or paged pool (Paged) memory.  You will also be looking at the Bytes (shown in yellow) column. This column shows the bytes in use for the particular process Tag.

    clip_image005

    The Allocs and Frees columns can be used to determine if a tag is leaking memory. If there is a large difference between these two columns for a particular tag, then there may be a leak in that particular tag and should be investigated.

    The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component.

    Where to get Pooltag.txt?

    After install the debugging tools for windows located here, pooltag.txt can be found in the C:\Program Files\Debugging Tools for Windows\triage directory and normally has the most recent list of pooltags.

    Pooltag.txt can also be obtained from the Windows Resource Kit:

    http://www.microsoft.com/downloads/details.aspx?FamilyID=9D467A69-57FF-4AE7-96EE-B18C4790CFFD&displaylang=en

    If the specific tag in question is not listed in pooltag.txt and is leaking memory, you can search for pool tags used by third-party drivers using the steps in the following article:

    How to find pool tags that are used by third-party drivers
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;298102

    Once you find what tag pertains to a specific driver, you would contact the vendor of that driver to see if they should have an updated version that may help alleviate this memory leak issue.

    Recommended remediation

    1. Install the recommended hotfixes for Windows 2003 server based clusters from 895092
    2. Run the Exchange Best Practices Analyzer (ExBPA) tool to ensure that the exchange server is configured optimally. (ie: SystemPages registry setting, any outdated network card drivers, video drivers or storage drivers (storport.sys or SAN drivers), Mount point drivers (mountmgr.sys), boot.ini settings, etc.)
    3. Ensure that Windows 2003 SP2 is installed. If SP2 is not installed, at a minimum, you need to apply the hotfix in 918976
    4. Ensure that the Scalable Networking Pack features have been disabled. See http://msexchangeteam.com/archive/2007/07/18/446400.aspx for more information on how this can affect Exchange Servers
    5. Upgrade ExIFS.sys to the version listed in 946799
    6. If using MPIO, ensure 923801 at a minimum is installed. 935561 is recommended. Also see 961640 for another known memory leak issue
    7. If Emulex drivers are installed, be sure to upgrade to the version listed here to help with nonpaged pool memory consumption.
    8. Disable any unused NICs to lower overall NPP memory consumption
    9. Update network card drivers to the latest version.
        • If Jumbo Frames are being used, be sure to set this back to the default setting or lower the overall frame size to help reduce NPP memory usage.
        • If Broadcom Drivers are being utilized and are using the Virtual Bus Device (VBD) drivers, be sure to update the drivers to a driver version later than 4.x. Check your OEM manufacturers website for updated versions or go to the Broadcom download page here to check on their latest driver versions.
        • Any changes to the Network Card receive buffers or Receive Descriptors from the default could increase overall NPP memory. Set them back to the default settings if at all possible. This can be seen in poolmon with an increase in MmCm pool allocations.
    10. Update video card drivers to the latest version. If any accelerated graphics drivers are enabled, go ahead and uninstall these drivers and switch the display driver to Standard VGA. Add the /basevideo switch to the boot.ini file and reboot the server.
    11. Check to see if the EnableDynamicBacklog setting is being used on the server which can consume additional nonpaged pool memory. See 951162.

    If you are still having problems with NonPaged pool memory at this point, then I would recommend calling Microsoft Customer Support for further assistance with this problem.

    Additional Reading

    Nonpaged Pool is over the warning threshold (ExBPA Rule)
    http://technet.microsoft.com/en-us/library/aa996269(EXCHG.80).aspx

    Understanding Pool Consumption and Event ID: 2020 or 2019
    http://blogs.msdn.com/ntdebugging/archive/2006/12/18/Understanding-Pool-Consumption-and-Event-ID_3A00_--2020-or-2019.aspx

    3GB switch
    http://blogs.technet.com/askperf/archive/2007/03/23/memory-management-demystifying-3gb.aspx

     

  • New ADAccess Performance counters included with Exchange 2007 SP2

    Exchange 2007 SP2 has a new set of ADAccess Performance counters that only shows performance data from domain controllers in the same site as the Exchange Server. This new object is MSExchange ADAccess Local Site Domain Controllers. Previously, you had to use MSExchange ADAccess Domain Controllers(*)\Local site flag to detect if the server was local via Performance monitor.

    Here is a listing of the new counters. They are very similar to the MSExchange ADAccess Domain Controllers counters, but only for local DCs.

    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read calls/Sec
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search calls/Sec
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Searches timed out per minute
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Fatal errors per minute
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Disconnects per minute
    \MSExchange ADAccess Local Site Domain Controllers(*)\User searches failed per minute
    \MSExchange ADAccess Local Site Domain Controllers(*)\Bind failures per minute
    \MSExchange ADAccess Local Site Domain Controllers(*)\Long running LDAP operations/Min
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Pages/Sec
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP VLV Requests/Sec
    \MSExchange ADAccess Local Site Domain Controllers(*)\Number of outstanding requests
    \MSExchange ADAccess Local Site Domain Controllers(*)\DsGetDcName elapsed time
    \MSExchange ADAccess Local Site Domain Controllers(*)\gethostbyname elapsed time
    \MSExchange ADAccess Local Site Domain Controllers(*)\Kerberos ticket lifetime
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP connection lifetime
    \MSExchange ADAccess Local Site Domain Controllers(*)\Reachability bitmask
    \MSExchange ADAccess Local Site Domain Controllers(*)\IsSynchronized flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\GC capable flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\PDC flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\SACL right flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\Critical Data flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\Netlogon flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\OS Version flag
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read Time
    \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search Time

    Unfortunately. upgrading to SP2 from an earlier service pack does not reload the MSExchange ADAccess counters, so you will have to do this manually. If you are installing Exchange using the SP2 binaries, you will have these new counters by default. To reload the MSExchange ADAccess counters, do the following:

    • Ensure that no other monitoring software is currently collecting performance counter data
    • Open a command prompt and change directory to the \Program Files\Microsoft\Exchange Server\Bin\perf\AMD64 directory
    • To unload the performance counters, type the following:
      unlodctr “MSExchange ADAccess”
    • To Reload the counters, type the following:
      loadcounter dscperf.ini
    • Restart the Exchange Services to successfully reload the counters. Note: This step is very important as Exchange opens file handles to the original counters that can only be reloaded with the restart of the Exchange Services.

    For all of you that are collecting performance counters via WMI, you may notice that these new counters will not appear to be loaded. You can verify this by running perfmon/wmi to see if they are there. If they are not, you can transfer the PDH settings over to WMI by running wmiadap /f.

    Enjoy!!