In many organizations the skill sets that administer Exchange, Active Directory (AD), Domain Naming System (DNS), and the network infrastructure are often segregated. Since Exchange is heavily dependent on communication with AD and that communication must traverse the network, the ability for the Exchange Administrator to have a basic understanding of how each of these components can be troubleshot is useful. Furthermore, often times the DNS, network, and AD administrators are not familiar with how Exchange is dependent on their services and thus are not fully prepared to help an Exchange administrator isolate production issues. As such, by the end of this it is hoped that there will be enough information to allow the Exchange, DNS, network, and AD administrators to each collect the data necessary to identify the majority of the issues.
There is some required reading that is necessary in order to understand much of the content that follows. Rather than reiterate all of the pre-requisite knowledge here, reviewing the following content might be helpful:
To start with, an understanding of what the performance counters "LDAP Read Time" and "LDAP Search Time" track is very useful (as identified in Ruling Out Active Directory-Bound Problems and Monitoring Common Counters: Exchange 2007 Help). One knows that once these counters approach or exceed the recommended thresholds, troubleshooting outside of Exchange needs to begin. Therefore, the question, "What are the potential contributors to high values for these counters?" is thus raised. Since these counters are installed with Exchange and are tracked within the DSAccess /ADAccess components, depending on the version of Exchange, there are a number of dependencies that Exchange does not have visibility into that the troubleshooter needs to be cognizant of when evaluating where the source of the delay is incurred. In order to provide a clearer picture of what needs to be considered, below is a walkthrough of a somewhat simplified list of events that happens after the clock starts ticking (reference How Active Directory Searches Work for greater detail):
1. WLDAP32.DLL receives the request from one of the Exchange processes. It has to locate a Global Catalog (GC) first. 2. DNS query traverses the network. Unresponsiveness or degraded responsiveness from DNS servers will degrade the overall performance of the query. 3. WLDAP32.DLL submits the query to the GC. 4. If not already started, a Transmission Control Protocol (TCP) session is established, the Lightweight Directory Access Protocol (LDAP) query traverses the wire.Note: since TCP requires a 3 stage handshake before the session and in turn windowing can work, multiply network latencies by 3 to figure out how much time it takes just to establish the session (thus that 10 ms latency becomes 30 ms delay even before the Exchange server submits the query). 5. The TCP data transmitted up the networking stack to LSASS.exe listening on the LDAP ports. 6. The query is processed by the GC and searches through the database to return the results. 7. The data is sent out the Network Interface Card (NIC) on the GC. 8. The query is received via WLDAP32.DLL from the GC. a. If there are multiple pages all pages need to be returned and stored in a data structure within DSAccess / ADAccess. Each of these requires traversal of the network and resubmission of the query. b. If there are a large quantity of values in the requested attribute (think group memberships), these all need to be retrieved and returned. This results in multiple queries to the GC. This may require steps 1 - 5 to recur.
1. WLDAP32.DLL receives the request from one of the Exchange processes. It has to locate a Global Catalog (GC) first.
2. DNS query traverses the network. Unresponsiveness or degraded responsiveness from DNS servers will degrade the overall performance of the query.
3. WLDAP32.DLL submits the query to the GC.
4. If not already started, a Transmission Control Protocol (TCP) session is established, the Lightweight Directory Access Protocol (LDAP) query traverses the wire.Note: since TCP requires a 3 stage handshake before the session and in turn windowing can work, multiply network latencies by 3 to figure out how much time it takes just to establish the session (thus that 10 ms latency becomes 30 ms delay even before the Exchange server submits the query).
5. The TCP data transmitted up the networking stack to LSASS.exe listening on the LDAP ports.
6. The query is processed by the GC and searches through the database to return the results.
7. The data is sent out the Network Interface Card (NIC) on the GC.
8. The query is received via WLDAP32.DLL from the GC.
a. If there are multiple pages all pages need to be returned and stored in a data structure within DSAccess / ADAccess. Each of these requires traversal of the network and resubmission of the query.
b. If there are a large quantity of values in the requested attribute (think group memberships), these all need to be retrieved and returned. This results in multiple queries to the GC. This may require steps 1 - 5 to recur.
As can be seen from the above set of steps a lot of the performance delays are heavily dependent on network latency and network performance. In fact the only portion of that number that is actually impacted by the GC performance itself is step 5. As such, in addition to troubleshooting any potential performance issues on the GC, it is highly advantageous to collect network trace data when issues do occur, preferably from both sides of the conversation.
Summary of Data needed to troubleshoot
Exchange Server - Having this data from both sides is important for comparison purposes (i.e. is the processor spike on the DC correlating with the degraded performance in Exchange)
Note: To ease collection of the Exchange performance data reference "Mike Lagase : Perfwiz replacement for Exchange 2007"
AD Server -
DNS Server - not all organizations use Microsoft DNS servers, but for the purposes of this article the assumption that a Microsoft DNS server is in use.
How to triage where troubleshooting needs to happen
Next, identify bottlenecks within the GC (note: for the most part these basics apply to troubleshooting DNS performance as well):
While there are a large variety of potential issues, there are some basic things that can be eliminated before advanced troubleshooting need take place. The below guidance are the most basic remediation suggestions for the largest variety of these types of issues. Essentially this comes down to, "Ok, so the issue is identified, now what do I do?"
Since there is plenty of published information on how to isolate disk, processor, and/or network issues, please reference the excellent articles linked at the beginning for the recommended thresholds. Also, check out Performance Analysis of Logs (PAL), which automates the analysis of the performance data collected to help determine where the system is bottlenecking.
Disk bound - Once the issue is isolated to being disk bound, there are essentially two options, add memory or allow for increased disk IO (i.e. add spindles for direct attached storage).
Processor bound:
Network:
Identifying the source of the GC load
The possibility exists that, despite the DC being either disk or processor bound, that the load is unnecessary. To determine this, the load on the box must be analyzed in greater detail than just the performance can provide. To that end, there are two main strategies for identifying the load on the box, both of which can be used concurrently. How to analyze this data is somewhat out of scope of this article, but the below bullets contain links to useful articles on how to use this data:
Conclusion
Data collection is key! Getting the right data when the problem occurs is absolutely critical to the restoring the health of the environment. Furthermore, as Exchange is the product suffering when its dependencies are not responding adequately, it falls on the Exchange administrator to do the leg work collecting the right data from the perspective of Exchange, and then isolating where the network, DNS, and AD administrators might need to look and start collecting data. Without this information, the network, DNS, and AD admins will be looking for a needle in a haystack and it is highly unlikely that they will find anything of use. Thus, once the Exchange admin has isolated the dependencies of concern, using the information provided here, it will be much easier to collect the data necessary from those systems to remediate any issues.
Thanks to Rod Fournier and Mike Lagase for their endless suggestions for improvement of this blog post.
- Ken Brumfield