Exchange 2007 SP2 has a new set of ADAccess Performance counters that only shows performance data from domain controllers in the same site as the Exchange Server. This new object is MSExchange ADAccess Local Site Domain Controllers. Previously, you had to use MSExchange ADAccess Domain Controllers(*)\Local site flag to detect if the server was local via Performance monitor.
Here is a listing of the new counters. They are very similar to the MSExchange ADAccess Domain Controllers counters, but only for local DCs.
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read calls/Sec \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search calls/Sec \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Searches timed out per minute \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Fatal errors per minute \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Disconnects per minute \MSExchange ADAccess Local Site Domain Controllers(*)\User searches failed per minute \MSExchange ADAccess Local Site Domain Controllers(*)\Bind failures per minute \MSExchange ADAccess Local Site Domain Controllers(*)\Long running LDAP operations/Min \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Pages/Sec \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP VLV Requests/Sec \MSExchange ADAccess Local Site Domain Controllers(*)\Number of outstanding requests \MSExchange ADAccess Local Site Domain Controllers(*)\DsGetDcName elapsed time \MSExchange ADAccess Local Site Domain Controllers(*)\gethostbyname elapsed time \MSExchange ADAccess Local Site Domain Controllers(*)\Kerberos ticket lifetime \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP connection lifetime \MSExchange ADAccess Local Site Domain Controllers(*)\Reachability bitmask \MSExchange ADAccess Local Site Domain Controllers(*)\IsSynchronized flag \MSExchange ADAccess Local Site Domain Controllers(*)\GC capable flag \MSExchange ADAccess Local Site Domain Controllers(*)\PDC flag \MSExchange ADAccess Local Site Domain Controllers(*)\SACL right flag \MSExchange ADAccess Local Site Domain Controllers(*)\Critical Data flag \MSExchange ADAccess Local Site Domain Controllers(*)\Netlogon flag \MSExchange ADAccess Local Site Domain Controllers(*)\OS Version flag \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read Time \MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search Time
Unfortunately. upgrading to SP2 from an earlier service pack does not reload the MSExchange ADAccess counters, so you will have to do this manually. If you are installing Exchange using the SP2 binaries, you will have these new counters by default. To reload the MSExchange ADAccess counters, do the following:
For all of you that are collecting performance counters via WMI, you may notice that these new counters will not appear to be loaded. You can verify this by running perfmon/wmi to see if they are there. If they are not, you can transfer the PDH settings over to WMI by running wmiadap /f.
Enjoy!!
Exchange 2007 SP2 has introduced some new Mailbox Access Auditing features to help log events when users access folders and messages either in their own mailbox or another users mailbox. I wrote a whitepaper on these new features at http://technet.microsoft.com/en-us/library/ee331009.aspx. This new access auditing will log accesses to messages and folders which some customers have been wanting for a long time. So if you attempt to access another users folder and open or read a message, Exchange will now log events in the new Exchange auditing log on the server. This only shows you the path of access to message and folders, but does not specifically log deletions of messages in users folders.
The whitepaper also discusses how you can setup auditing to track configuration changes to Exchange related objects in Active Directory, so that if an administrator made a change to an Exchange configuration object that caused an outage, these events will now be logged on the domain controllers security event log. If your DC’s are Windows 2008, you can see what the previous values were and what the newly changed value is, so if you need to change it back to the way it was before the outage, you have a rolling log of all of these changes.
If you have some time and wanted to read more about it, see the above link for more details. This took a lot of time and effort on my part to pull this together and test most of the configuration auditing pieces to ensure that we were logging the correct data. Hope you enjoy it.
In support, we get a lot of statements stating that Exchange is using all or most of the memory on a given server. Some may say that is a bad thing, but for Exchange 2007 that is actually a good thing. In this blog, I would like to explain some of the misconceptions of Exchange’s memory usage with relationship to overall/allocated memory and the Paging file and it’s usage. I previously blogged about Exchange 2007 memory usage at Understanding Exchange 2007 Memory Usage and its use of the Paging File, but it appears that more clarification is needed in this area. I am also going to show some real world screenshots of customers actual perfmon log files that show good and bad behavior and how this might help you in troubleshooting what type of a memory issue you might have, if any.
So let’s start with the paging file and it’s usage as that appears to be a common question that comes up all of the time. Some of the questions stem around PF Usage in Task Manager as shown below on a Windows 2003 server and server monitoring software reporting this as a problem. PF Usage in Task Manager is the total number committed pages in the system. This is not how much is currently being used, it is merely the amount of page file space that has been allocated should the OS need to page out currently committed bytes
In Windows 2008, Task Manager now shows different terminology as PF Usage has been removed and has been replaced with just the word Memory
There are other counters that show PF usage as well, one is Paging File\% Usage which shows overall usage and Process\Page File Bytes which shows per process Page file allocation. The % Usage counter is about the same as what Task Manager PF Usage shows. It is just the amount of space that has been allocated should committed bytes need to get paged out and doesn’t indicate if the PF is currently being utilized. Paging File\% Usage is a counter that monitoring software commonly shows that could be a potential problem, but in all reality it might not be. Other factors needs to be looked at other than the amount of page file usage to get a clear indication if there is truly a problem or not.
Generally, Page file usage should remain under 80% at all times, but there are times when the OS needs to make use of the paging file and one example is a working set trim operation. The following picture shows an example of this working set trim operation for store.exe where the Memory\% PF Usage shows that we increase the PF usage at the same time the working sets are getting trimmed to satisfy some other application or driver request for allocating a contiguous memory block. You will also notice that PF usage never really bounces back either after something like this happens and remains around a sustained average for the remainder of the server being online or until the server is rebooted. Unless you are getting close to the max Memory\% Committed Bytes In Use, we shouldn’t be too concerned with the PF Usage unless we are seeing some high paging activity going on.
With that said, you would not use PF usage in Task Manager or Paging File\% Usage to determine if the paging file is currently being used. What you would use to monitor this is the amount of Memory\Pages/Sec that are occurring. This counter is a combination of both Memory\Pages Input/sec and Memory\Pages Output/Sec counters which also includes access to the system cache for file based operations to resolve hard page faults. Hard page faults occur when a page from a process is requested but does not exist in memory. This then means that we have to pull that data directly from the paging or backing file. If the page is elsewhere in memory, then this is called a soft fault. These two counters will help you understand if you are writing data (Pages Output) to the Paging file or you are reading data (Pages Input) from the paging file which might be affecting overall Exchange Server performance. Hard Page Faults can result is significant delays in processing data on the server.
Counter Definitions Memory\Pages/Sec - Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory\Pages Input/sec and Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.
Memory\Pages Input/sec - Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. Compare the value of Memory\\Pages Input/sec to the value of Memory\\Page Reads/sec to determine the average number of pages read into memory during each read operation.
Memory\Pages Output/Sec - Pages Output/sec is the rate at which pages are written to disk to free up space in physical memory. Pages are written back to disk only if they are changed in physical memory, so they are likely to hold data, not code. A high rate of pages output might indicate a memory shortage. Windows writes more pages back to disk to free up space when physical memory is in short supply. This counter shows the number of pages, and can be compared to other counts of pages, without conversion.
Recommended guidance states that the size of the paging file should be RAM+10MB for optimal performance and should be of static size and not system managed. Having a paging file set to system managed could cause page file fragmentation which could affect performance in memory pressure conditions, but Exchange generally should not be making use of the paging file for normal operations. If virtual memory is shown to be problematic or high due to other applications on the servers requiring it, you can increase the size of the paging file to RAM * 1.5 to help alleviate some of this memory pressure on the server to help back all of the committed pages in memory. If you are still having problems at this point, check for potential memory leaks within the processes on the server.
High paging in excess of 10,000/sec or more could indicate severe memory pressure or a working set trimming problem that I talked about previously in http://blogs.technet.com/mikelag/archive/2007/12/19/working-set-trimming.aspx.
The amount of available memory is another question that comes up regularly. The main performance counter to monitor for available memory is Memory\Available MBytes. This is the amount of physical memory that is available for process or system use. It is the sum of Free, Zero, and Standby (cached) page lists. If you are on a Windows 2008 server and run Process Explorer viewing System Information, you will see these page lists referenced. Available RAM on any given Exchange 2007 server should not go below 100MB. After crossing the 100MB threshold, you are putting your server in a state vulnerable for working set trims when the Virtual Memory manager needs to process a memory allocation request in which sufficient RAM is not available to service that request. Another counter to check to cross correlate why available memory is low is Memory\System Cache Resident Bytes. Memory\System Cache Resident Bytes is part of the overall System cache which is viewable via the Memory\Cache Bytes counter.
The above picture is a depiction of how System cache can affect available memory leading up to a working set trim. Notice in yellow that the Store cache remains consistent prior to the trim, so we know that Exchange did not cause this, but rather some other application. This could be some application making use of the file cache causing this increase. A simple file copy operation of a very large file from this server to another server will cause this problem. You can tame this system cache problem by using the Windows Dynamic Cache service shown at http://blogs.msdn.com/ntdebugging/archive/2009/02/06/microsoft-windows-dynamic-cache-service.aspx. In the above case, it was Antivirus software making use of memory mapped files.
Note: If available RAM is about 100MB, please do not RDP in to the server and fire up the EMC for administration purposes. This will exhaust all RAM on the server and cause working set trim issues. Got to love that one, eh?
Next, I would like to talk about Committed Memory. There are two main counters that I look at when troubleshooting memory related issues to determine if we are truly running out of memory on a server. These counters are Memory\Committed Bytes and Memory\Commit Limit.
Memory\Committed Bytes is the amount of committed virtual memory, in bytes. Committed memory is the physical memory which has space reserved on the disk paging file(s). This counter displays the last collected value and is not an average.
Memory\Commit Limit is the amount of virtual memory that can be committed without having to extend the paging file(s). It is measured in bytes. Committed memory is the physical memory which has space reserved on the disk paging files. There can be one paging file on each logical drive). If the paging file(s) are be expanded, this limit increases accordingly. This counter displays the last collected value and is not an average. The Commit Limit is calculated by taking the amount of total RAM and adding that to the Paging File sizes. This sum will give you your overall Commit Limit on any given server.
There are a few ways to view the current values of the Commit Limit and Committed Bytes. In Task Manager, you could view the Commit Charge (K) area as shown in the above screenshot. You can view these counters in Perfmon, and of course using Process Explorer shown below.
Now that we have all of this knowledge, let’s take a look at some real world examples here.
The below picture shows a normal working server where the Store working set remains consistent throughout the lifetime of the store process due to the cache being warmed up or fully populated. This is where you get maximum performance from your Exchange server since you are caching all of the data in memory instead of having to rely on paging information to/from the disks. You will also notice that available memory is just under 2GB. The amount of committed bytes is also no where close to the Commit limit on the server.
The following example shows that our Committed Bytes is just about equal to the overall Commit Limit on the server. Any new memory allocations could be failing causing instability of your Exchange server. This problem was attributed to an incorrectly configured paging file on the server.
The next example shows an actual Store memory leak. As you can see, the Committed Bytes (blue), Private Bytes (pink) and Virtual Bytes (yellow) for Store is also increasing upward to the overall Commit Limit (green). This occurred due to a recursive operation within the store process exhausting all of the memory on the server. A recursive operation can be thought of a process that is being performed where one of the operations is to repeat the process. This is similar to a loop with no ending or a way to break out of the loop.
I hope this clears up some of the misconceptions behind the command phrase “Exchange is using all the memory”.
Recently we had a case in which an Exchange 2003 server would hang and no longer accept any new RPC connections to the Information Store. The rest of the server seemed to be operating just fine, but it was the Store that was ultimately having the problems.
I took a look at the perfmon data that was provided and didn’t see anything out of the ordinary except for a small amount of RPC Operations taking place on the server. The server did look like it was processing data though, so this was quite intriguing to me now. I did notice that one DC had a number outstanding LDAP requests for an extended period of time as shown below.
We ended up taking some dumps of the Store, IIS processes, and LSASS to see what might be going on. The store and IIS dumps were not that interesting. Looking at the LSASS dumps was an eye opener. We saw that over 150 threads were hung up calling (SECUR32!CALLSPM) in to the Security Provider Manager (SPM). The beginning of the stacks were showing secur32!LsaAcceptSecurityContext calls which were mostly client authentication calls to the server. More info on the AcceptSecurityContext calls can be found here
There was almost 200 other threads that were calling netlogon!NlpUserValidateHigher which essentially means that we are trying to send a user validation request to a higher authority for authentication requests over the secure channel. Once we accept this validation request, we then attempt to connect to the DC over RPC to handle the request. Debug analysis can be found on Dave Goldman’s blog here.
By default, Netlogon only allows 2 concurrent API calls for these authentication requests which is controlled by a semaphore. If the 2 semaphore objects are tied up waiting for a response from the DC, all other requests will start queuing, thus having this mysterious hang affect on the Exchange Server. This was our problem since the debug analysis showed that we hit our maximum of 2 concurrent requests most likely to an overloaded DC, leaving a backlog of requests for authentication traffic. This request queue is controlled by the MaxConcurrentApi setting. Each request has a default timeout of 45 seconds, so if there were requests that were timing out, this is surely going to cause some delays for other users. On healthy servers with good network connectivity, these authentication requests are extremely fast.
At this point, we knew that we were tied up in authentication calls to DC’s, but we couldn’t find out what users were trying to logon which were taking the most amount of time. The debug information only shows a point in time. It is possible that a user could be trying to authenticate to a down-level domain across a slow WAN link, not sure at this time.
To move forward, we enabled Netlogon Debug logging per http://support.microsoft.com/kb/109626 and let the problem occur again.
We opened the netlogon.log file and started reviewing the information. Prior to the problem we can see that responses are returning in a timely manner. Notice the time intervals happen within the same second
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0 07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0 07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0
As traffic increases, the response times are starting to get a little slower
07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0 07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered 07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0
07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered 07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
Now we see a response time right at 45 second timeout below
07/31 11:57:02 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered 07/31 11:57:47 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
Here is where our first netlogon timeout hit
07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 11:57:48 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot. 07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1) 07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E 0xC000005E = STATUS_NO_LOGON_SERVERS
07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered 07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 11:57:48 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot. 07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1) 07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E
0xC000005E = STATUS_NO_LOGON_SERVERS
Now we are seeing that we cannot allocate a Client API slow because the max request queue is busy servicing other requests
07/31 11:58:55 [CRITICAL]EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot. 07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 11:58:55 [CRITICAL]EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258 07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
Now we get to an actual DC timeout error as shown below.
08/01 17:21:24 [CRITICAL] NlPrintRpcDebug: Couldn't get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064) 08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: timeout call to \\DC1.domain.com. Count: 2 08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: dropping the session to \\DC1.domain.com 08/01 17:21:24 [CRITICAL] EXDOMAIN: NlSetStatusClientSession: Set connection status to c000005e
We can see clearly now that DC1 was having problems servicing authentication requests to this Exchange server. This does not always mean that the DC is overloaded, it could be a down level network trust that is really slow that is causing this problem, so additional investigation needs to be performed at this point. We just know that Exchange is the victim and the problem is elsewhere now.
Troubleshooting methodologies
So what can we do at this point?
We can test secure channels for different domains to see which domains might be failing. First we will need to obtain the DC in which the secure channel is currently formed on the Exchange server by running nlttest /sc_query:<domain> replacing domain with the domain name where the Exchange Server resides in.
Once that DC is found, you will then run a command similar to the following for each of the domains:
nltest /server: DC1 /sc_query:ASIA nltest /server: DC1 /sc_query:EUROPE nltest /server: DC1 /sc_query:US
This will help fish out any down level domains that could be causing authentication delays.
You can also enable netlogon debug logging on the DC’s to help understand the traffic patterns there.
Installing the Server Performance Advisor on the Windows 2003 DC’s or using the Active Directory Diagnostics Data Collector in the Windows 2008 Reliability and Performance monitor will help fish out any potential bottlenecks.
Take netmon captures and search for NetrLogonSamLogonEx entries for Netlogon requests
For Windows 2003 servers, you can install the following hotfix to help track these type issues faster. This hotfix adds new performance counters to help track access to these semaphores better. Windows 2008 servers already have this built in to the OS
New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication http://support.microsoft.com/default.aspx/kb/928576
The main ones you want to look at at the following
Semaphore Holders: How many threads on average are holding the client semaphore This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded. Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring. Semaphore Waiters: The average number of waiters waiting on the semaphore. This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.
Semaphore Holders: How many threads on average are holding the client semaphore
This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting
By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded.
Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore
These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring.
Semaphore Waiters: The average number of waiters waiting on the semaphore.
This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.
In some instances on heavily loaded servers, you may want to adjust MaxConcurrentApi to a value of 5 on both the Exchange Servers and DC’s to help widen the pipe or increase the amount of auth requests that can occur at any given time. Bumping this setting up may help alleviate this problem altogether, but could also prolong the issue due to some other underlying issue that has now been masked. It’s always best to understand where the problem is coming from before making any major changes such as this which may increase overall processor utilization on the Exchange server and your domain controllers.
In this instance, we set MaxConcurrentApi to 5 on the DC’s and Exchange Servers and this appears to have reduced the amount of occurrences of this problem.
This particular problem not only affects Exchange servers, but also affects other applications such as ISA server. More information on this can be found here
I hope this provides some insight in to some of the underlying dependency problems that you may seen in Exchange.
That is all for now.