Reposting some great content from a fellow PFE, that I believe should resonate with this audience!
Microsoft Exchange Server is the leading messaging platform in the world today, and can be found in SMBs, large enterprises and also behind cloud services of unimaginable scale. Exchange has been with us since its release in April 1996 and from that first version, it has evolved to offer businesses and consumers outstanding features that they have come to heavily rely upon.
This is fantastic! Until something happens…. That is, something bad happens, or something really bad happens.
When people are unable to access the services that they have become dependent upon, then: “Houston, we have a problem!” (Well, technically that should read “Houston we’ve had a problem here”, but… oh well!)
There are multiple preventative steps that we can take to help ensure that bad things don’t happen.
Are you are an IT or business professional who needs to ensure:
If the answer to any of these questions is yes, then the chances are you may be wondering how best to achieve that! Microsoft has an offering that can help you and your business address these challenges!
So, what is it? The answer you have been seeking is the Exchange Risk Assessment Program (ExRAP).
The ExRAP is designed to review an enterprise’s Exchange organization. An ExRAP engagement helps identify both existing problems and risks of future problems by reviewing performance, operational processes and Exchange configuration settings.
ExRAP can help you solve the challenges above, and many more. By engaging with a certified Microsoft Exchange Premier Field Engineer (PFE) you have access to the ExRAP toolset, with all the solutions and knowledge that it contains.
Microsoft Premier Support helps thousands of customers each year achieve higher uptime and performance from their Exchange servers. Premier Field Engineering conducts ExRAPs against environments of all imaginable levels, ranging from organisations with only a couple of Exchange servers to large corporates with tens or hundreds of servers. The largest ExRAP to date was for over 900 servers! Most ExRAPs are performed at a slightly lower number than this, typically in the 10 – 30 server range, but it varies by geography and customer request.
Your Technical Account Manager (TAM) or Service Delivery Manager (SDM) will typically have discussed and positioned the ExRAP with your organisation, and a PFE is normally contacted once the engagement has been agreed upon.
At this time you and your TAM should have discussed the size of the Exchange environment as that directly correlates to the amount of time (hours) needed to analyse it. As you can imagine, the larger the environment, the longer it takes, and thus the cost increases. ExRAP pricing is broken in to distinct tiers and it is the tier that will dictate the number of hours required.
The ExRAP is then booked into the engagement system, and an accredited PFE is contacted to deliver it.
The next step is to acquire scoping information, using a scoping tool. An on-site delivery cannot commence until valid scoping details have been sent to Microsoft, and this is required for two reasons:
So submission of the scoping data is mandatory before the engagement can continue. Other questions that sometimes arise at this point may be something like:
While the team behind ExRAP at Microsoft certainly understands why customers ask these questions, they unfortunately defeat the design principles of ExRAP: ExRAP is designed to work against a holistic view of the Exchange environment, and as such we look at the entire Exchange organisation - not just a subset. Time and time again we have seen issues caused by one server impact other Exchange servers, and if we were to only analyse a portion of the environment we would not be able to accurately determine its true risk and health status. In some other RAPs, like SQL, SharePoint of Clustering, isolated instances are scanned, but ExRAP and ADRAP (for Active Directory) both require scanning of all of the relevant servers in a forest.
Now that we have reviewed some of the background to ExRAP, what are some of the top items that we see commonly reoccurring? Let’s look at the top 5 critical severity and top 5 high severity issues that are typically found globally during ExRAPs.
If you can address these top issues then you have already started to make fantastic inroads into maintaining Exchange server uptime!
Failure to keep your core infrastructure servers updated with the latest security updates and service packs, especially those with a "critical" severity rating, leaves the entire environment at extreme risk to service outages, data loss and exposure, and other malicious activities. An outage of the Exchange infrastructure can be a business loss-generating event; a quiet compromise of the Exchange infrastructure could be even more critical.
Your Microsoft TAM will send out an email every month describing the upcoming updates that will be released on “patch Tuesday” and then a more detailed email after the updates are available. You can also sign up for the update notification service yourself, and pass this onto others. Microsoft strongly recommends proactively reviewing these bulletins for applicability, testing the updates in a lab, and once validated, installing into production in a defined maintenance window.
The Microsoft Baseline Security Analyzer (MBSA) is a useful tool for both scanning and reporting security status for a single server or across the computing environment. It scans for common incorrect configurations, overlooked default options, and the installation status of the latest security hot fixes from Microsoft. It also fully integrates with Microsoft Windows Server Update Services to scan systems according to a predetermined configuration.
The MBSA tool and its documentation can be obtained from the Microsoft Security Web site.
Microsoft Windows Server Update Services (WSUS) is available within the latest versions of Windows Server.
Not having a clearly documented disaster recovery SLA can have a number of consequences, including the following:
In consultation with management, administrators should develop disaster recovery SLAs that accurately reflect the business needs and requirements of the user community. These plans should then be documented to provide clear guidelines for handling a disaster recovery incident. Disaster recovery SLAs also allow administrators to set appropriate expectations during a disaster recovery incident, which can help provide administrators with time to perform basic analysis of the root cause of an issue, to prevent reoccurrence.
Disaster recovery service level agreements should take into account the recovery levels and the various business requirements of the user community. At a minimum, a disaster recovery plan should also include the following:
Establishing a Service Level Agreement
Service Level Agreements (SLAs) are negotiated agreements between IT and end customers. These agreements should contain several service targets including availability targets and windows of measurement.
Sometimes customers and IT do not effectively communicate such details, and as a consequence, misunderstandings regrettably occur. Typically this results in customers expecting 100% availability, whether or not that was funded or if even possible in the environment.
Another consequence of having no SLA is that IT does not have a mark to shoot for and by extension no bar from which to measure success. Having at least minimal service level agreements with customers, IT can improve the relationship with customers and also set expectations that they can manage.
Exchange Service Management Guide
MOF Service Level Management Service Management Function Guide
MOF Availability Management Service Management Function Guide
ITIL Service Delivery Publication
Disaster recovery is one of the most important functions of Exchange Server administrators. Recovering from many disasters frequently requires the coordination of multiple individuals, perhaps across multiple teams. Without a predefined plan for activating and coordinating these critical resources, the success of your recovery is left to chance and circumstance.
A well-documented Disaster Recovery Plan reduces the time spent deciding what to do, helps keep those involved up-to-date, and ensures that your organization can recover as quickly and efficiently as possible. Your plan should also ensure that the services and infrastructure upon which Exchange Server relies are available, reliable, and recoverable. An additional benefit in creating a Disaster Recovery Plan is that, during the plan-development process, you may discover areas where your systems are vulnerable. These vulnerabilities can then be reduced or removed to make your systems more robust and recoverable
Create, test and maintain a detailed DR plan. Ensure that all documentation and the various prerequisites are available should the primary site totally cease to function. There have been several cases where issues were observed due to documentation and files only stored in the primary site. When that datacentre failed, all access to the required documentation was lost.
Understanding High Availability and Site Resilience in Exchange 2010
Disaster Recovery for Microsoft Exchange Server 2007
Exchange Server 2003 High Availability Guide
Service Level Agreements are agreements between IT and the customer, while Operational Level Agreements (OLAs) are agreements between the messaging team and the other groups that own the services supporting Exchange. Creating and maintaining SLAs and OLAs for your organization are critical first step in being able to measure your own rate of success with Exchange Server. If SLAs and the corresponding OLAs are not present then it is difficult to design or accurately predict the outcome of an Exchange Server implementation.
· Negotiate initial operational level agreements
· Measure and report achievement(s)
· Strive to improve the agreements over time
Monitoring an Exchange Server environment is a critical aspect or running a successful Exchange organisation. Ineffective or absent monitoring can lead to negative effects on performance, availability, and security.
Ensure that all relevant performance counters are monitored. This is to include not just RPC latency, store RPC latency, Disk latency, RPC operations but all other counters. Installing any monitoring tool and then assuming that the default installation will successfully monitor Exchange is a falsehood. Some counters will need to be added and others tuned down. TechNet has values for the counters. But as Captain Jack Sparrow often says, they are guidelines rather than rules. For example, organisations that run Outlook in online mode exclusively will be far less tolerant of disk IO blips and thus the thresholds must be considered in the milieu of a given organisation.
Troubleshooting Microsoft Exchange Server 2003 Performance
Monitoring without System Center Operations Monitor (Exchange 2007)
Monitoring Exchange server 2010
DNS is the primary name resolution mechanism for Active Directory, Exchange Server and Outlook clients. Invalid DNS data will break AD replication, authentication and resource lookups. A direct result of this will be that Exchange cannot locate catalog servers and Outlook clients cannot communicate with Exchange.
Utilise tools to ensure that all of the record required by DCs are registered into DNS by using automated tools. This is to include checking both SRV and A records, it is not sufficient to just ping a DC as this does not fully validate records used by NetLogon. DNS Lint and the DCDIAG.exe /Test DNS are highlighted below. Other monitoring tools and platforms are also able to achieve the same.
Troubleshooting DNS using the DCDIAG Tool
Description of the DNSLint Utility
Using the DNSLint Utility
Testing updates before deployment can help minimize the risk of the adverse effect that the update might introduce in your environment. Although the depth of testing should depend on the business importance of email to your business, some level of testing needs to be performed.
Based off the business importance that the messaging environment has, one can then create a test plan that works to meet the SLA. At a bare minimum, a test Exchange server should be established and the update installed to verify that no obvious issues arise. The issue of not testing patches is often encountered with a customer not having a test lab at all, the test lab is so out of sync with production or the test lab has been cannibalised for parts & resources that it is ineffective.
Create a test environment that ideally mirrors production. Some organisations chose to deploy the test environment on virtual machines (VMs) which is fine; however the crucial aspect is ensuring that the relevant aspects of the environment are tested. For example mail flow, client connectivity, server to server interoperability and ideally extended to include 3rd party services. Note that this should be a separate forest. How can you test schema extensions on a “test” machine that is in the corporate forest? Answer is that you cannot, as the schema will be updated and replicate to all servers regardless of whether or not they are deemed “test”.
MOF Release Management Service Management Function Guide
ITIL Service Support Publication
Complexity and inconsistencies are two demons that will challenge any infrastructure. Complexity for complexity’s sake is generally a poor idea, and simplicity will win out as it is easier to support in the long run. If there are no build standards this will result in servers having different configurations depending upon who built them and what they had for breakfast. As a direct result of this, failed changes will increase. Thus the time (and cost) of troubleshooting will also increase.
The first step in bringing servers under control and minimizing complexity is to create and follow a step-by-step build document for servers. By following a detailed build document, servers will be consistent when they go into production. Change management will then assist in keeping them consistent during their lifecycle.
Detailed build documentation should be created to document a server’s entire configuration. This is to include hardware specific configuration, OS, Exchange, service pack and update levels and also all of the 3rd party components that make up your messaging ecosystem.
Additionally drift from the known configuration should be proactively tracked and monitored upon. This is called Desired Configuration Management (DCM) and is a function of SSCM. If you are interested in obtaining assistance from PFE with this please speak to your TAM as we have a specific offering that meets this need.
SCCM Desired Configuration Monitoring
Exchange 2010 template build documents template
Measuring, reporting, and publishing availability data for the Messaging service is essential and assists with:
There is no point in having an SLA and not measuring to see if you are meeting it. This can be called “driving in the dark” or not “keeping yourself honest”, either way you do not know if you are actually meeting the SLA requirements. .
Leverage an automated toolset that calculates and reports upon the Messaging and email availability as defined by your SLA. These reports should be available within your organisation and can then be used to drive improvements to the messaging services that your provide to end users. Should you find that SLAs are not being met; a conversation can now happen with the business to ask for funding, development time or the SAL gets modified. It may not be ideal, but all parties know how the environment is performing.
Now that we have gone through the top 10 issues, you are prepared to work to address them within your organisation, and by doing so can improve the uptime and reliability of messaging services in your environment.
You may have noticed that the majority of this article has been around the “softer” side of managing Exchange, and not just gnarly and arcane technical facts. Why is that, you may ask?
In a nutshell: technology does not exist in isolation.
You may have seen the MOF diagram that discusses people, processes and technology? Of these three areas the smallest is technology. I have personally seen customers have better uptime from well-maintained standalone systems, compared with others that have badly maintained “highly-available” clusters.
Because of this, the biggest impact can often be generated from improving the policies, processes and management practices within the messaging environment. By creating the necessary documentation and processes, you ensure everyone knows how the Exchange environment should look, how it is meant to be administered and the level of services that end users should expect!