• Released: Update Rollup 5 for Exchange 2010 Service Pack 3 and Update Rollup 13 for Exchange 2007 Service Pack 3

    The Exchange team is announcing the availability of the following updates:

    Exchange Server 2010 Service Pack 3 Update Rollup 5 resolves customer reported issues and includes previously released security bulletins for Exchange Server 2010 Service Pack 3. A complete list of the issues resolved in this rollup is available in KB2917508.

    Exchange Server 2007 Service Pack 3 Update Rollup 13 provides recent DST changes and adds the ability to publish a 2007 Edge Server from Exchange Server 2013. Update Rollup 13 also contains all previously released security bulletins and fixes and updates for Exchange Server 2007 Service Pack 3. More information on this rollup is available in KB2917522.

    Neither release is classified as a security release but customers are encouraged to deploy these updates to their environment once proper validation has been completed.

    Note: KB articles may not be fully available at the time of publishing of this post.

    The Exchange Team

  • The Preferred Architecture

    During my session at the recent Microsoft Exchange Conference (MEC), I revealed Microsoft’s preferred architecture (PA) for Exchange Server 2013. The PA is the Exchange Engineering Team’s prescriptive approach to what we believe is the optimum deployment architecture for Exchange 2013, and one that is very similar to what we deploy in Office 365.

    While Exchange 2013 offers a wide variety of architectural choices for on-premises deployments, the architecture discussed below is our most scrutinized one ever. While there are other supported deployment architectures, other architectures are not recommended.

    The PA is designed with several business requirements in mind. For example, requirements that the architecture be able to:

    • Include both high availability within the datacenter, and site resilience between datacenters
    • Support multiple copies of each database, thereby allowing for quick activation
    • Reduce the cost of the messaging infrastructure
    • Increase availability by optimizing around failure domains and reducing complexity

    The specific prescriptive nature of the PA means of course that not every customer will be able to deploy it (for example, customers without multiple datacenters). And some of our customers have different business requirements or other needs, which necessitate an architecture different from that shown here. If you fall into those categories, and you want to deploy Exchange on-premises, there are still advantages to adhering as closely as possible to the PA where possible, and deviate only where your requirements widely differ. Alternatively, you can consider Office 365 where you can take advantage of the PA without having to deploy or manage servers.

    Before I delve into the PA, I think it is important that you understand a concept that is the cornerstone for this architecture – simplicity.

    Simplicity

    Failure happens. There is no technology that can change this. Disks, servers, racks, network appliances, cables, power substations, generators, operating systems, applications (like Exchange), drivers, and other services – there is simply no part of an IT services offering that is not subject to failure.

    One way to mitigate failure is to build in redundancy. Where one entity is likely to fail, two or more entities are used. This pattern can be observed in Web server arrays, disk arrays, and the like. But redundancy by itself can be prohibitively expensive (simple multiplication of cost). For example, the cost and complexity of the SAN based storage system that was at the heart of Exchange until the 2007 release, drove the Exchange Team to step up its investment in the storage stack and to evolve the Exchange application to integrate the important elements of storage directly into its architecture. We recognized that every SAN system would ultimately fail, and that implementing a highly redundant system using SAN technology would be cost-prohibitive. In response, Exchange has evolved from requiring expensive, scaled-up, high-performance SAN storage and related peripherals, to now being able to run on cheap, scaled-out servers with commodity, low-performance SAS/SATA drives in a JBOD configuration with commodity disk controllers. This architecture enables Exchange to be resilient to any storage related failure, while enabling you to deploy large mailboxes at a reasonable cost.

    By building the replication architecture into Exchange and optimizing Exchange for commodity storage, the failure mode is predictable from a storage perspective. This approach does not stop at the storage layer; redundant NICs, power supplies, etc., can also be removed from the server hardware. Whether it is a disk, controller, or motherboard that fails, the end result should be the same, another database copy is activated and takes over.

    The more complex the hardware or software architecture, the more unpredictable failure events can be. Managing failure at any scale is all about making recovery predictable, which drives the necessity to having predictable failure modes. Examples of complex redundancy are active/passive network appliance pairs, aggregation points on the network with complex routing configurations, network teaming, RAID, multiple fiber pathways, etc. Removing complex redundancy seems unintuitive on its face – how can removing redundancy increase availability? Moving away from complex redundancy models to a software-based redundancy model, creates a predictable failure mode.

    The PA removes complexity and redundancy where necessary to drive the architecture to a predictable recovery model: when a failure occurs, another copy of the affected database is activated.

    The PA is divided into four areas of focus:

    1. Namespace design
    2. Datacenter design
    3. Server design
    4. DAG design

    Namespace Design

    In the Namespace Planning and Load Balancing Principles articles, I outlined the various configuration choices that are available with Exchange 2013. From a namespace perspective, the choices are to either deploy a bound namespace (having a preference for the users to operate out of a specific datacenter) or an unbound namespace (having the users connect to any datacenter without preference).

    The recommended approach is to utilize the unbound model, deploying a single namespace per client protocol for the site resilient datacenter pair (where each datacenter is assumed to represent its own Active Directory site - see more details on that below). For example:

    • autodiscover.contoso.com
    • For HTTP clients: mail.contoso.com
    • For IMAP clients: imap.contoso.com
    • For SMTP clients: smtp.contoso.com

    namespacedesign
    Figure 1: Namespace Design

    Each namespace is load balanced across both datacenters in a configuration that does not leverage session affinity, resulting in fifty percent of traffic being proxied between datacenters. Traffic is equally distributed across the datacenters in the site resilient pair, via DNS round-robin, geo-DNS, or other similar solution you may have at your disposal. Though from our perspective, the simpler solution is the least complex and easier to manage, so our recommendation is to leverage DNS round-robin.

    In the event that you have multiple site resilient datacenter pairs in your environment, you will need to decide if you want to have a single worldwide namespace, or if you want to control the traffic to each specific datacenter pair by using regional namespaces. Ultimately your decision depends on your network topology and the associated cost with using an unbound model; for example, if you have datacenters located in North America and Europe, the network link between these regions might not only be costly, but it might also have high latency, which can introduce user pain and operational issues. In that case, it makes sense to deploy a bound model with a separate namespace for each region.

    Site Resilient Datacenter Pair Design

    To achieve a highly available and site resilient architecture, you must have two or more datacenters that are well-connected (ideally, you want a low round-trip network latency, otherwise replication and the client experience are adversely affected). In addition, the datacenters should be connected via redundant network paths supplied by different operating carriers.

    While we support stretching an Active Directory site across multiple datacenters, for the PA we recommend having each datacenter be its own Active Directory site. There are two reasons:

    1. Transport site resilience via Shadow Redundancy and Safety Net can only be achieved when the DAG has members located in more than one Active Directory site.
    2. Active Directory has published guidance that states that subnets should be placed in different Active Directory sites when the round trip latency is greater than 10ms between the subnets.

    Server Design

    In the PA, all servers are physical, multi-role servers. Physical hardware is deployed rather than virtualized hardware for two reasons:

    1. The servers are scaled to utilize eighty percent of resources during the worst-failure mode.
    2. Virtualization adds an additional layer of management and complexity, which introduces additional recovery modes that do not add value, as Exchange provides equivalent functionality out of the box.

    By deploying multi-role servers, the architecture is simplified as all servers have the same hardware, installation process, and configuration options. Consistency across servers also simplifies administration. Multi-role servers provide more efficient use of server resources by distributing the Client Access and Mailbox resources across a larger pool of servers. Client Access and Database Availability Group (DAG) resiliency is also increased, as there are more servers available for the load-balanced pool and for the DAG.

    Commodity server platforms (e.g., 2U servers that hold 12 large form-factor drive bays within the server chassis) are use in the PA. Additional drive bays can be deployed per-server depending on the number of mailboxes, mailbox size, and the server’s scalability.

    Each server houses a single RAID1 disk pair for the operating system, Exchange binaries, protocol/client logs, and transport database. The rest of the storage is configured as JBOD, using large capacity 7.2K RPM serially attached SCSI (SAS) disks (while SATA disks are also available, the SAS equivalent provides better IO and a lower annualized failure rate). Bitlocker is used to encrypt each disk, thereby providing data encryption at rest and mitigating concerns around data theft via disk replacement.

    To ensure that the capacity and IO of each disk is used as efficiently as possible, four database copies are deployed per-disk. The normal run-time copy layout (calculated in the Exchange 2013 Server Role Requirements Calculator) ensures that there is no more than a single copy activated per-disk.

    ServerDesign
    Figure 2: Server Design

    At least one disk in the disk pool is reserved as a hot spare. AutoReseed is enabled and quickly restores database redundancy after a disk failure by activating the hot spare and initiating database copy reseeds.

    Database Availability Group Design

    Within each site resilient datacenter pair you will have one or more DAGs.

    DAG Configuration

    As with the namespace model, each DAG within the site resilient datacenter pair operates in an unbound model with active copies distributed equally across all servers in the DAG. This model provides two benefits:

    1. Ensures that each DAG member’s full stack of services is being validated (client connectivity, replication pipeline, transport, etc.).
    2. Distributes the load across as many servers as possible during a failure scenario, thereby only incrementally increasing resource utilization across the remaining members within the DAG.

    Each datacenter is symmetrical, with an equal number of member servers within a DAG residing in each datacenter. This means that each DAG contains an even number of servers and uses a witness server for quorum arbitration.

    The DAG is the fundamental building block in Exchange 2013. With respect to DAG size, a larger DAG provides more redundancy and resources. Within the PA, the goal is to deploy larger DAGs (typically starting out with an eight member DAG and increasing the number of servers as required to meet your requirements) and only create new DAGs when scalability introduces concerns over the existing database copy layout.

    DAG Network Design

    Since the introduction of continuous replication in Exchange 2007, Exchange has recommended multiple replication networks for separating client traffic from replication traffic. Deploying two networks allows you to isolate certain traffic along different network pathways and ensure that during certain events (e.g., reseed events) the network interface is not saturated (which is an issue with 100Mb, and to a certain extent, 1Gb interfaces). However, for most customers, having two networks operating in this manner was only a logical separation, as the same copper fabric was used by both networks in the underlying network architecture.

    With 10Gb networks becoming the standard, the PA moves away from the previous guidance of separating client traffic from replication traffic. A single network interface is all that is needed because ultimately our goal is to achieve a standard recovery model despite the failure - whether a server failure occurs or a network failure occurs, the result is the same, a database copy is activated on another server within the DAG. This architectural change simplifies the network stack, and obviates the need to eliminate heartbeat cross-talk.

    Witness Server Placement

    Ultimately, the placement of the witness server determines whether the architecture can provide automatic datacenter failover capabilities or whether it will require a manual activation to enable service in the event of a site failure.

    If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAG’s witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage.

    DAG Design
    Figure 3: DAG (Three Datacenter) Design

    If your organization does not have a third location, then place the witness server in one of the datacenters within the site resilient datacenter pair. If you have multiple DAGs within the site resilient datacenter pair, then place the witness server for all DAGs in the same datacenter (typically the datacenter where the majority of the users are physically located). Also, make sure the Primary Active Manager (PAM) for each DAG is also located in the same datacenter.

    Data Resiliency

    Data resiliency is achieved by deploying multiple database copies. In the PA, database copies are distributed across the site resilient datacenter pair, thereby ensuring that mailbox data is protected from software, hardware and even datacenter failures.

    Each database has four copies, with two copies in each datacenter, which means at a minimum, the PA requires four servers. Out of these four copies, three of them are configured as highly available. The fourth copy (the copy with the highest Activation Preference) is configured as a lagged database copy. Due to the server design, each copy of a database is isolated from its other copies, thereby reducing failure domains and increasing the overall availability of the solution as discussed in DAG: Beyond the “A”.

    The purpose of the lagged database copy is to provide a recovery mechanism for the rare event of system-wide, catastrophic logical corruption. It is not intended for individual mailbox recovery or mailbox item recovery.

    The lagged database copy is configured with a seven day ReplayLagTime. In addition, the Replay Lag Manager is also enabled to provide dynamic log file play down for lagged copies. This feature ensures that the lagged database copy can be automatically played down and made highly available in the following scenarios:

    • When a low disk space threshold is reached
    • When the lagged copy has physical corruption and needs to be page patched
    • When there are fewer than three available healthy copies (active or passive) for more than 24 hours

    By using the lagged database copy in this manner, it is important to understand that the lagged database copy is not a guaranteed point-in-time backup. The lagged database copy will have an availability threshold, typically around 90%, due to periods where the disk containing a lagged copy is lost due to disk failure, the lagged copy becoming an HA copy (due to automatic play down), as well as, the periods where the lagged database copy is re-building the replay queue.

    To protect against accidental (or malicious) item deletion, Single Item Recovery or In-Place Hold technologies are used, and the Deleted Item Retention window is set to a value that meets or exceeds any defined item-level recovery SLA.

    With all of these technologies in play, traditional backups are unnecessary; as a result, the PA leverages Exchange Native Data Protection.

    Summary

    The PA takes advantage of the changes made in Exchange 2013 to simplify your Exchange deployment, without decreasing the availability or the resiliency of the deployment. And in some scenarios, when compared to previous generations, the PA increases availability and resiliency of your deployment.

    Ross Smith IV
    Principal Program Manager
    Office 365 Customer Experience

  • Exchange 2010 SP1 FAQ and Known Issues

    Last week we released Exchange Server 2010 Service Pack 1. It has received some great feedback and reviews from customers, experts, analysts, and the Exchange community.

    The starting point for SP1 setup/upgrade should be the What's New in SP1, SP1 Release Notes, and Prerequisites docs. As with any new release, there are some frequently asked deployment questions, and known issues, or issues reported by some customers. You may not face these in your environment, but we're posting these here along with some workarounds so you're aware of them as you test and deploy SP1.

    1. Upgrade order

      The order of upgrade from Exchange 2010 RTM to SP1 hasn’t changed from what was done in Exchange 2007. Upgrade server roles in the following order:

      1. Client Access server
      2. Hub Transport server
      3. Unified Messaging server
      4. Mailbox server

      The Edge Transport server role can be upgraded at any time; however, we recommend upgrading Edge Transport either before all other server roles have been upgraded or after all other server roles have been upgraded. For more details, see Upgrade from Exchange 2010 RTM to Exchange 2010 SP1 in the documenation.

    2. Exchange 2010 SP1 Prerequisites

      Exchange 2010 SP1 requires the installation of 4-5 hotfixes, depending on the operating system – Windows Server 2008, or Windows Server 2008 R2. To install the Exchange 2010 SP1 administration tools on Windows 7 and Windows Vista, you requires 2 hotfixes.

      Note: Due to the shared code base for these updates, Windows Server 2008 and Windows Vista share the same updates. Similarly, Windows Server 2008 R2 and Windows 7 share the same updates. Make sure you select the x64 versions of each update to be installed on your Exchange 2010 servers.

      Update 2/11/2011: Windows 2008 R2 SP1 includes all the required hotfixes listed in this table — 979744, 983440, 979099, 982867 and 977020. If you're installing Exchange 2010 SP1 on a server running Windows 2008 R2 SP1, you don't need to install these hotfixes separately. For a complete list of all updates included in Windows 2008 R2 SP1, see Updates in Win7 and WS08R2 SP1.xls.

      Here’s a matrix of the updates required, including download locations and file names.

      HotfixDownloadWindows Server 2008Windows Server 2008 R2Windows 7 & Windows Vista
      979744
      A .NET Framework 2.0-based Multi-AppDomain application stops responding when you run the application
      MSDN
      or Microsoft Connect
      Windows6.0-KB979744-x64.msu (CBS: Vista/Win2K8)
      Windows6.1-KB979744-x64.msu (CBS: Win7/Win2K8 R2)
      N. A.
      983440
      An ASP.NET 2.0 hotfix rollup package is available for Windows 7 and for Windows Server 2008 R2
      Request from CSS
      N. A.
      Yes
      N.A.
      977624
      AD RMS clients do not authenticate federated identity providers in Windows Server 2008 or in Windows Vista. Without this update, Active Directory Rights Management Services (AD RMS) features may stop working
      Request from CSS Select the download for Windows Vista for the x64 platform.
      N.A.
      N.A.
      979917
      Two issues occur when you deploy an ASP.NET 2.0-based application on a server that is running IIS 7.0 or IIS 7.5 in Integrated mode
      MSDN Windows6.0-KB979917-x64.msu (Vista)
      N. A.
      N. A.
      973136,
      FIX: ArgumentNullException exception error message when a .NET Framework 2.0 SP2-based application tries to process a response with zero-length content to an asynchronous ASP.NET Web service request: "Value cannot be null".

      Microsoft Connect

      Windows6.0-KB973136-x64.msu
      N.A.
      N. A.
      977592
      RPC over HTTP clients cannot connect to the Windows Server 2008 RPC over HTTP servers that have RPC load balancing enabled.

      Request from CSS

      Select the download for Windows Vista (x64)
      N.A.
      N. A.

      979099
      An update is available to remove the application manifest expiry feature from AD RMS clients.

      Download Center
      N. A.
      Windows6.1-KB979099-x64.msu
      N. A.
      982867
      WCF services that are hosted by computers together with a NLB fail in .NET Framework 3.5 SP1
      MSDN Windows6.0-KB982867-v2-x64.msu (Vista) Windows6.1-KB982867-v2-x64.msu (Win7) X86: Windows6.1-KB982867-v2-x86.msu (Win7)
      x64: Windows6.1-KB982867-v2-x64.msu (Win7)
      977020
      FIX: An application that is based on the Microsoft .NET Framework 2.0 Service Pack 2 and that invokes a Web service call asynchronously throws an exception on a computer that is running Windows 7.
      Microsoft Connect
      N. A.

      x64: Windows6.1-KB977020-v2-x64.msu

      x64: Windows6.1-KB977020-v2-x64.msu

      X86: Windows6.1-KB977020-v2-x86.msu

      Some of the hotfixes would have been rolled up in a Windows update or service pack. Given that the Exchange team released SP1 earlier than what was planned and announced earlier, it did not align with some of the work with the Windows platform. As a result, some hotfixes are available from MSDN/Connect, and some require that you request them online using the links in the corresponding KBAs. The administrator experience when initially downloading these hotfixes may be a little odd. However, once you download the hotfixes, and receive two of the hotfixes from CSS, you can use the same for subsequent installs on other servers. In due course, all these updates may become available on the Download Center, and also through Windows Update.

      These hotfixes have been tested extensively as part of Exchange 2010 SP1 deployments within Microsoft and by our TAP customers. They are fully supported by Microsoft.

    3. Prerequisite download pages linked from SP1 Setup are unavailable

      When installing Exchange Server 2010 SP1 the prereq check may turn up some required hotfixes to install. The message will include a link to click for help. Clicking this link redirects you to a page saying that the content does not exist.

      We're working to update the linked content.

      Meanwhile, please refer to the TechNet article Exchange 2010 Prerequisites to download and install the prerequisites required for your server version (the hotfixes are linked to in the above table, but you'll still need to install the usual prerequisites such as .Net Framework 3.5 SP1, Windows Remote Management (WinRM) 2.0, and the required OS components).

    4. The Missing Exchange Management Shell Shortcut

      Some customers have reported that after upgrading an Exchange Server 2010 server to Exchange 2010 SP1, the Exchange Management Shell shortcut is missing from program options. Additionally, the .ps1 script files associated with the EMS may also be missing.

      We’re actively investigating this issue. Meanwhile, here’s a workaround:

      1. Verify that the following files are present in the %ExchangeInstallPath%\bindirectory:
        • - CommonConnectFunctions.ps1
        • - CommonConnectFunctions.strings.psd1
        • - Connect-ExchangeServer-help.xml
        • - ConnectFunctions.ps1
        • - ConnectFunctions.strings.psd1
        • - RemoteExchange.ps1
        • - RemoteExchange.strings.psd1

        NOTE: If these files are missing, you can copy the files from the Exchange Server 2010 Service Pack 1 installation media to the %ExchangeInstallPath%\bin directory. These files are present in the \setup\serverroles\common folder.

      2. Click Start -> AdmiinistrativeTools ->, right-click Windows PowerShell Modules, select Send to -> Desktop (as shortcut)
      3. Go to the Properties of the shortcut and on Target replace the path to C:\WINDOWS\system32\WindowsPowerShell\v1.0\powershell.exe -noexit -command ". 'C:\Program Files\Microsoft\Exchange Server\V14\bin\RemoteExchange.ps1'; Connect-ExchangeServer -auto"

        Note: if the Exchange installation folder or drive name is different than the default, you need to change the path accordingly.

    5. Upgrading Edge Transport on Forefront Threat Management Gateway (TMG) and Forefront Protection for Exchange 2010

      If you upgrade a server with the Edge Transport server role running with ForeFront Threat Management Gateway (TMG) and ForeFront Protection for Exchange (FPE) enabled for SMTP protection, the ForeFront TMG Managed Control Service may fail to start and E-mail policy configuration settings cannot be applied.

      The TMG team is working on this issue. See Problems when installing Exchange 2010 Service Pack 1 on a TMG configured for Mail protection on the ForeFront TMG (ISA) Team Blog. Exchange 2010 SP1 Release Notes has been updated with the above information.

      The ForeFront TMG product team has released a software update to address this issue. See Software Update 1 for Microsoft Forefront Threat Management Gateway (TMG) 2010 Service Pack 1 now available for download.

    6. Static Address Book Service Port Configuration Changes

      The location for setting the port the address book service should use has changed in SP1. In Exchange 2010 RTM you had to edit the Microsoft.exchange.addressbook.service.exe.config to configure the service port. In SP1 you must use the following registry key:
      Path: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\MSExchangeAB\Parameters
      Value name: RpcTcpPort
      Type: REG_SZ (String)


      When you apply SP1 to a machine where you had previously configured a static port by editing the Microsoft.exchange.addressbook.service.exe.config file, the upgrade process will not carry forward your static port assignments. Following a restart, the Address Book Service will revert to using a dynamic port instead of a static port specified in the config file. This may cause interruptions in service.

      As with all upgrades where servers are in load balanced pools, we recommend you perform a rolling upgrade — removing servers from the pool, updating them and then moving the pool to the newly upgraded machines. Alternatively, we recommend that you upgrade an array of servers by draining connections from any one machine before you upgrade it.

      There are times when these approaches may not be possible. You can maintain your static port configuration, and have it take effect the moment the address book service starts for the first time following the application of the service pack, by creating the registry key BEFORE you apply SP1 to your server. The registry key has no impact pre SP1, and so by configuring it before you apply the Service Pack you can avoid the need to make changes to set the port post install, and avoid any service interruptions.

    7. iPhone, OWA Premium and POP3 & IMAP4 issues due to invalid accepted domain

      After applying E2010 SP1:

      1. iPhone users may not be able to view the content of incoming messages in their Inboxes, and when they try to open a message, they get an error saying:

        This message has not been downloaded from the server.

        Admins may see the following event logged in the Application Event Log on Exchange 2010 CAS Server:

        Watson report about to be sent for process id: 1234, with parameters: E12, c-RTL-AMD64, 14.01.0218.011, AirSync, MSExchange ActiveSync, Microsoft.Exchange.Data.Storage.InboundConversionOptions.CheckImceaDomain, UnexpectedCondition:ArgumentException, 4321, 14.01.0218.015.

      2. OWA Premium users may not be able to reply or forward a message. They may see the following error in OWA:

        An unexpected error occurred and your request couldn't be handled. Exception type: System.ArgumentException, Exception message: imceaDomain must be a valid domain name.

      3. POP3 & IMAP4 users may also not be able to retrieve incoming mail and Admins will see the following event logged in Event Log:

        ERR Server Unavailable. 21; RpcC=6; Excpt=imceaDomain must be a valid domain name.

      Resolution

      Please run the following command under Exchange Management Shell and verify that there is one domain marked as ‘Default’ and it's DomainName & Name values are valid domain names. We were able to reproduce the issue by setting a domain name with a space in it, like "aa bb"

      Get-AcceptedDomain | fl

      If you also have an invalid domain name there (for example, a domain name with a space in it), then removing the space and restarting the server will fix the EAS (iPhone), OWA, POP3 & IMAP4 issues as mentioned above.

      Command to run under EMS would be:

      Set-AcceptedDomain –Identity -Name “ValidSMTPDomainName”

      Thes examples update the Name parameter of the "My Company" and "ABC Local" accepted domains (the space is removed from both):

      Set-AcceptedDomain –Identity “My Company” –Name “MyCompany.Com”
      Set-AcceptedDomain –Identity “ABC Local” –Name “ABC.Local”

    8. Error when adding or removing a mailbox database copy

      If a server running Exchange 2010 RTM (or Exchange 2010 SP1 Beta) is upgraded to Exchange 2010 SP1, administrators may experience an error when using the Add-MailboxdDatabaseCopy or Remove-MailboxDatabaseCopy cmdlets to add or remove DAG members.

      When you try to add a DAG member, you may see the following error:

      Add-MailboxDatabaseCopy DAG-DB0 -MailboxServer DAG-2

      The result:

      WARNING: An unexpected error has occurred and a Watson dump is being generated: Registry key has subkeys and recursive removes are not supported by this method.

      Registry key has subkeys and recursive removes are not supported by this method.
      + CategoryInfo : NotSpecified: (:) [Add-MailboxDatabaseCopy], InvalidOperationException
      + FullyQualifiedErrorId : System.InvalidOperationException,Microsoft.Exchange.Management.SystemConfigurationTasks.
      AddMailboxDatabaseCopy

      The command is not successful in adding the copy or updating Active Directory to show the copy was added. This happens due to presence of the DumpsterInfo registry key.

      Workaround: Delete the DumpsterInfo key, as shown below.

      1. Identify the GUID of the database that is being added using this command:

        Get-MailboxDatabase DAG-DB0 | fl name,GUID

        The result:

        Name : DAG-DB0
        Guid : 8d3a9778-851c-40a4-91af-65a2c487b4cc

      2. On the server specified in the add command, using the database GUID identified, remove the following registry key:
        HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v14\Replay\State\<db-guid>\DumpsterInfo

        The GUID identified in this case is 8d3a9778-851c-40a4-91af-65a2c487b4cc. With this information you can now export and delete the DumpsterInfo key on the server where you are attempting to add the mailbox database copy. This can be easily done using the registry editor, but if you have more than a handful of DAG members, this is best automated using the Shell.

        This example removes the DumpsterInfo key from the 8d3a9778-851c-40a4-91af-65a2c487b4cc key:

        Remove-Item HKLM:\Software\Microsoft\ExchangeServer\V14\Replay\State\8d3a9778-851c-40a4-91af-65a2c487b4cc\DumpsterInfo

        To automate this across all servers in your organization, use the DeleteDumpsterRegKey.ps1 script.

        File: deletedumpsterregkey_ps1.txt
        Description: The DeleteDumpsterRegkey.ps1 script can be used to delete the offending DumpsterInfo registry keys that can cause this problem on all Exchange 2010 SP1 Mailbox servers in the organization. Rename the file to DeleteDumpsterRegkey.ps1 (remove the .txt extension).

        For more info, see Tim McMichael’s blog post Exchange 2010 SP1: Error when adding or removing a mailbox database copy.

    Thanks to all the folks in CSS and Exchange teams who helped identify, validate and provide workarounds for some of the issues mentioned above, and to the Exchange community and MVPs for their feedback.

    Bharat Suneja, Nino Bilic
    M. Amir Haque, Greg Taylor,
    & Tim McMichael

    Updates:

    • 9/7/2010: Updated list of files for the missing Exchange Management Shell shortcut issue
    • 9/15/2010: Udpated pre-reqs table:
      - 982867 required on Windows 2008 SP2
      - 983440 not required on Windows 2008 SP2
      - 977020 required on Windows 2008 R2
    • 9/21/2010: Added link to Software Update 1 for ForeFront Threat Management Gateway (TMG) 2010 Service Pack 1
      - Replaced "Request from CSS..." verbiage for KB 979917 with link to KB 979917 download on MSDN
    • 9/22/2010: Updated correct default Exchange install path (highlighted) in 'The Missing Exchange Management Shell Shortcut" section: C:\WINDOWS\system32\WindowsPowerShell\v1.0\powershell.exe -noexit -command ". 'C:\Program Files\Microsoft\Exchange Server\V14\bin\RemoteExchange.ps1'; Connect-ExchangeServer -auto"
  • Exchange, Firewalls, and Support… Oh, my!

    Over the years Exchange Server architecture has gone through a number of changes. As a product matures over time you may see us change what is supported as we react to changes in the product architecture, the state of technology as a whole, or major support issues we see come in through our support infrastructure.

    Over the years a large volume of support calls have ended up being caused by communication issues between Exchange servers or between Exchange servers and domain controllers. Often times this results from a network device between the servers not allowing some port or protocol to communicate to the other servers.

    I tried to get Harrison Ford to co-write this article with me given his specific talents, but alas he was busy and regretfully couldn’t partake. Please allow me to start with the short version up front so there is no confusion about what we currently DO and DO NOT support before I lose some of you to;

    b1
    Image Courtesy of: http://knowyourmeme.com/memes/tldr

    Starting with Exchange Server 2007 and current as of Exchange Server 2013, having network devices blocking ports/protocols between Exchange servers within a single organization or between Exchange servers and domain controllers in an organization is not supported. A network device may sit in the communication path between the servers, but a rule allowing “ANY/ANY” port and protocol communication must be in place allowing free communication between Exchange servers as well as between Exchange servers and domain controllers.

    For Exchange Server 2010 this is already articulated at http://technet.microsoft.com/en-us/library/bb331973(v=EXCHG.141).aspx under Client Access Server Connectivity in the Client Access Server section in the following paragraph.

    “In addition to having a Client Access server in every Active Directory site that contains a Mailbox server, it’s important to avoid restricting traffic between Exchange servers. Make sure that all defined ports that are used by Exchange are open in both directions between all source and destination servers. The installation of a firewall between Exchange servers or between an Exchange 2010 Mailbox or Client Access server and Active Directory isn’t supported. However, you can install a network device if traffic isn’t restricted and all available ports are open between the various Exchange servers and Active Directory.”

    Why has this seemingly simple support statement become so muddied and confusing over the years? Maybe we didn’t make it blunt enough to start with, but there could be some other compounding points adding to the confusion.

    Confusion point #1…

    Exchange Server 2003 was the last version of Exchange Server to allow deploying (at the time) a Front-End server in a perimeter network (aka DMZ) while locating the Back-End server in the intranet. While this could be made to work it required a specialized set of rules that essentially turned your perimeter network security model into the following:

    b2
    Image Courtesy of: The Internet

    During the time of Exchange Server 2003 adoption of reverse proxies within perimeter networks was on the rise. Reverse proxies allowed customers to more securely publish Exchange Server for remote access while only allowing a single port and protocol to traverse from the Internet to the perimeter network, and then a single port and protocol to traverse from the perimeter network to the intranet.

    You could go from something complicated like this with endless port and protocol requirements….

    b3
    Figure 1: Legacy Exchange 2003 deployment with Front-End server in a perimeter network. What a mess. Who’s hungry?

    To something simple like this…

    b4
    Figure 2: Reverse proxy in the perimeter network and all Exchange servers within the intranet. Simplicity at its best!

    The resulting increase in simplicity as well as the drop in support cases was strong enough for Microsoft to determine during the lifecycle of the next major version of Exchange Server, 2007, that we would no longer support deploying what is now the Client Access Server role in a perimeter network. From that time on all Exchange servers, except for Edge Transport Server role, were to be deployed on the intranet with unfettered access to each other. We do have this documented in http://technet.microsoft.com/en-us/library/bb232184.aspx.

    Confusion point #2…

    TechNet includes a number of articles that document many if not all of the ports and protocols Exchange Server utilizes during the course of normal operation. These documents are often misunderstood as “configure your firewall this way” articles. The information is only being provided for educational purposes on the inner-workings of Exchange Server, or to aid with the configuration of load balancing or service monitoring mechanisms which often require specific port/protocol definitions to perform their functions correctly. In case you come across them in the future, here is a list of most of those articles.

    Exchange Network Port References

    Understanding Protocols, Ports, and Services in Unified Messaging

    I don’t trust my clients and I like restricting their access. Is that supported?

    This is a different story and yes there are things you can do here to remain supported. Exchange Server has for a number of revisions supported configuring static client communication ports for Windows based Outlook clients. After the client contacts the endpoint mapper service running on Windows under TCP Port 135 it will be handed back the static TCP port you have chosen to use in your environment. For Exchange Server 2010 you may be familiar with the following article describing how to configure static client communication ports for the Address Book Service and the RPC Client Access Service, thereby leaving you with 4 ports required for clients to operate in MAPI/RPC mode.

    • TCP 135 for the Endpoint Mapper
    • TCP 443 for Autodiscover/EWS/ECP
    • TCP <your choice #1> for the Address Book service
    • TCP <your choice #2> for the RPC Client Access service
    • UDP ANY from CAS to the Outlook 2003 client if you’re in online mode and utilizing UDP notifications.

    http://social.technet.microsoft.com/wiki/contents/articles/864.configure-static-rpc-ports-on-an-exchange-2010-client-access-server.aspx

    TechNet also has resources for versions prior to Exchange Server 2010: http://support.microsoft.com/kb/270836.

    Starting in Exchange Server 2013 the only protocol supported for Windows Outlook clients is RPC over HTTPS. This architectural change reduces your required port count to one, TCP 443 for HTTPS, to be utilized by Autodiscover, Exchange Web Services, and RPC over HTTPS (aka Outlook Anywhere). This is going to make your life easy, but don’t tell your boss as they’ll expect you to start doing other things as well. It’ll be our secret. Promise. I’ll go through some examples of supported deployments, but will keep it easy and only use Outlook clients. The same ideas apply to other POP/IMAP/EAS clients as well, just don’t restrict Exchange servers from talking to each other. A setup like the following Outlook 2010 / Exchange 2010 diagram would be entirely supported where we have a firewall between the clients and the servers. In all of the following examples I have chosen static TCP port 59531 for my RPC Client Access Service on CAS and Mailbox, and static TCP port 59532 for my Address Book Service on CAS. UDP notifications are also thrown in for fun for those of you running Outlook 2003 in Online Mode, which I hope is very few and far between these days. Domain controllers were left out of these diagrams to focus on communication directly between clients and Exchange, and load balancers were also kept out for simplicity.

    b5
    Figure 3: Firewall between clients and all Exchange servers. Supported if firewall is configured correctly to allow all necessary client access. AD not shown.

    However, if you attempted to do something naughty like the following diagram and for reasons unknown to us put a firewall between CAS and Mailbox then there had better be ANY/ANY rules in place allowing conversations to originate from either side between Exchange servers.

    b6
    Figure 4: Firewall between CAS and other Exchange servers. Supported only if the firewall is configured for unfettered access between Exchange servers, and Exchange servers and AD. AD not shown.

    Well what if you have multiple datacenters with Exchange and want to firewall everything everywhere because you believe that as the number of firewalls goes up your security must exponentially increase? We’ve got you covered there too, deploy it like this where you’ll see both MAPI/RPC and RPC/HTTPS user examples. I didn’t bother putting load balancers or Domain Controllers into any of these diagrams by the way. I’m putting faith in all of you that you know where those go.

    b7
    Figure 5: Firewalls between users and Exchange servers as well as between datacenters. Supported if the firewalls are configured to allow unfettered access between Exchange servers, between Exchange servers and AD, and appropriate client rules. AD not shown.

    Boy this is going to be easy when all of you migrate to Exchange Server 2013 and are only dealing with RPC/HTTPS connections from clients and SMTP or HTTPS between servers. Except for maybe those pesky POP/IMAP/UM clients…

    Figure 6 below depicts what Exchange 2013 network conversations may look like at a high level. A load balancer and additional CAS were introduced to show we don’t care what CAS a client’s traffic goes through as they all end up at the same Mailbox Server anyways where the user’s database is mounted. You may have read previously Exchange Server 2013 does not require affinity for client traffic and hopefully this visual helps show why.

    The one tricky bit to consider if placing a firewall in between clients and Exchange Server 2013 would be UM traffic as it is not all client to CAS in nature. In Exchange Server 2013 a telephony device first makes a SIP connection through CAS (orange arrows) which after speaking with the UM Service on Mailbox Server will redirect the client so it may establish a direct SIP+RTP session (blue arrow) to the Mailbox Server holding the user’s active database copy for the RTP connection.

    Blog - Exchange and Firewalls - 2013
    Figure 6: Showing at a high level SMTP, Windows Outlook Client, and UM traffic with a firewall between users and Exchange Server 2013.

    So, Microsoft, if you’re saying this should be simple then what can I do and remain in a supported state?

    The key here is to not block traffic between Exchange servers, or between Exchange servers and Domain Controllers. As long no traffic blocking is performed between these servers you will be in a fully supported deployment and will not have to waste time with our support staff proving you really do have all necessary communications open before you can start to troubleshoot an issue. We know many customers will continue to test the boundaries of supportability regardless, but be aware this may drag out your troubleshooting experience and possibly extend an active outage. We prefer to help our customers resolve any and all issues as fast as possible. Staying within support guidelines does in fact help us help you as expeditiously as possible, and in the end will save you time, support costs, labor costs, and last but not least aggravation.

    Brian Day
    Program Manager
    Exchange Customer Experience

  • Database Maintenance in Exchange 2010

    Over the last several months there has been significant chatter around what is background database maintenance and why is it important for Exchange 2010 databases. Hopefully this article will answer these questions.

    What maintenance tasks need to be performed against the database?

    The following tasks need to be routinely performed against Exchange databases:

    Database Compaction

    The primary purpose of database compaction is to free up unused space within the database file (however, it should be noted that this does not return that unused space to the file system). The intention is to free up pages in the database by compacting records onto the fewest number of pages possible, thus reducing the amount of I/O necessary. The ESE database engine does this by taking the database metadata, which is the information within the database that describes tables in the database, and for each table, visiting each page in the table, and attempting to move records onto logically ordered pages.

    Maintaining a lean database file footprint is important for several reasons, including the following:

    1. Reducing the time associated with backing up the database file
    2. Maintaining a predictable database file size, which is important for server/storage sizing purposes.

    Prior to Exchange 2010, database compaction operations were performed during the online maintenance window. This process produced random IO as it walked the database and re-ordered records across pages. This process was literally too good in previous versions – by freeing up database pages and re-ordering the records, the pages were always in a random order. Coupled with the store schema architecture, this meant that any request to pull a set of data (like downloading items within a folder) always resulted in random IO.

    In Exchange 2010, database compaction was redesigned such that contiguity is preferred over space compaction. In addition, database compaction was moved out of the online maintenance window and is now a background process that runs continuously.

    Database Defragmentation

    Database defragmentation is new to Exchange 2010 and is also referred to as OLD v2 and B+ tree defragmentation. Its function is to compact as well as defragment (make sequential) database tables that have been marked/hinted as sequential. Database defragmentation is important to maintain efficient utilization of disk resources over time (make the IO more sequential as opposed to random) as well as to maintain the compactness of tables marked as sequential.

    You can think of the database defragmentation process as a monitor that watches other database page operations to determine if there is work to do. It monitors all tables for free pages, and if a table gets to a threshold where a significant high percentage of the total B+ Tree page count is free, it gives the free pages back to the root. It also works to maintain contiguity within a table set with sequential space hints (a table created with a known sequential usage pattern). If database defragmentation sees a scan/pre-read on a sequential table and the records are not stored on sequential pages within the table, the process will defrag that section of the table, by moving all of the impacted pages to a new extent in the B+ tree. You can use the performance counters (mentioned in the monitoring section) to see how little work database defragmentation performs once a steady state is reached.

    Database defragmentation is a background process that analyzes the database continuously as operations are performed, and then triggers asynchronous work when necessary. Database defragmentation is throttled under two scenarios:

    1. The max number of outstanding tasks This keeps database defragmentation from doing too much work the first pass if massive change has occurred in the database.
    2. A latency throttle of 100ms When the system is overloaded, database defragmentation will start punting defragmentation work. Punted work will get executed the next time the database goes through that same operational pattern. There's nothing that remembers what defragmentation work was punted and goes back and executes it once the system has more resources.

    Database Checksumming

    Database checksumming (also known as Online Database Scanning) is the process where the database is read in large chunks and each page is checksummed (checked for physical page corruption). Checksumming’s primary purpose is to detect physical corruption and lost flushes that may not be getting detected by transactional operations (stale pages).

    With Exchange 2007 RTM and all previous versions, checksumming operations happened during the backup process. This posed a problem for replicated databases, as the only copy to be checksummed was the copy being backed up. For the scenario where the passive copy was being backed up, this meant that the active copy was not being checksummed. So in Exchange 2007 SP1, we introduced a new optional online maintenance task, Online Maintenance Checksum (for more information, see Exchange 2007 SP1 ESE Changes – Part 2).

    In Exchange 2010, database scanning checksums the database and performs post Exchange 2010 Store crash operations. Space can be leaked due to crashes, and online database scanning finds and recovers lost space. Database checksum reads approximately 5 MB per second for each actively scanning database (both active and passive copies) using 256KB IOs. The I/O is 100 percent sequential. The system in Exchange 2010 is designed with the expectation that every database is fully scanned once every seven days.

    If the scan takes longer than seven days, an event is recorded in the Application Log :

    Event ID: 733
    Event Type: Information
    Event Source: ESE
    Description: Information Store (15964) MDB01: Online Maintenance Database Checksumming background task is NOT finishing on time for database 'd:\mdb\mdb01.edb'. This pass started on 11/10/2011 and has been running for 604800 seconds (over 7 days) so far.

    If it takes longer than seven days to complete the scan on the active database copy, the following entry will be recorded in the Application Log once the scan has completed:

    Event ID: 735
    Event Type: Information
    Event Source: ESE
    Description: Information Store (15964) MDB01 Database Maintenance has completed a full pass on database 'd:\mdb\mdb01.edb'. This pass started on 11/10/2011 and ran for a total of 777600 seconds. This database maintenance task exceeded the 7 day maintenance completion threshold. One or more of the following actions should be taken: increase the IO performance/throughput of the volume hosting the database, reduce the database size, and/or reduce non-database maintenance IO.

    In addition, an in-flight warning will also be recorded in the Application Log when it takes longer than 7 days to complete.

    In Exchange 2010, there are now two modes to run database checksumming on active database copies:

    1. Run in the background 24×7 This is the default behavior. It should be used for all databases, especially for databases that are larger than 1TB. Exchange scans the database no more than once per day. This read I/O is 100 percent sequential (which makes it easy on the disk) and equates to a scanning rate of about 5 megabytes (MB)/sec on most systems. The scanning process is single threaded and is throttled by IO latency. The higher the latency, the more database checksum slows down because it is waiting longer for the last batch to complete before issuing another batch scan of pages (8 pages are read at a time).
    2. Run in the scheduled mailbox database maintenance process When you select this option, database checksumming is the last task. You can configure how long it runs by changing the mailbox database maintenance schedule. This option should only be used with databases smaller than 1 terabyte (TB) in size, which require less time to complete a full scan.

    Regardless of the database size, our recommendation is to leverage the default behavior and not configure database checksum operations against the active database as a scheduled process (i.e., don’t configure it as a process within the online maintenance window).

    For passive database copies, database checksums occur during runtime, continuously operating in the background.

    Page Patching

    Page patching is the process where corrupt pages are replaced by healthy copies. As mentioned previously, corrupt page detection is a function of database checksumming (in addition, corrupt pages are also detected at run time when the page is stored in the database cache). Page patching works against highly available (HA) database copies. How a corrupt page is repaired depends on whether the HA database copy is active or passive.

    Page patching process

    On active database copies On passive database copies
    1. A corrupt page(s) is detected.
    2. A marker is written into the active log file. This marker indicates the corrupt page number and that page requires replacement.
    3. An entry is added to the page patch request list.
    4. The active log file is closed.
    5. The Replication service ships the log file to passive database copies.
    6. The Replication service on a target Mailbox server receives the shipped log file and inspects it.
    7. The Information Store on the target server replays the log file and replays up to marker, retrieves its healthy version of the page, invokes Replay Service callback and ships the page to the source Mailbox server.
    8. The source Mailbox server receives the healthy version of the page, confirms that there is an entry in the page patch request list, then writes the page to the log buffer, and correspondingly, the page is inserted into the database cache.
    9. The corresponding entry in the page patch request list is removed.
    10. At this point the database is considered patched (at some later point the checkpoint will advance and the database cache will be flushed and the corrupt page on disk will be overwritten).
    11. Any other copy of this page (received from another passive copy) will be silently dropped, because there is no corresponding entry in the page patch request list.
    1. On the Mailbox server where the corrupt page(s) is detected, log replay is paused for the affected database copy.
    2. The replication service coordinates with the Mailbox server that is hosting the active database copy and retrieves the corrupted page(s) and the required log range from the active copy’s database header.
    3. The Mailbox server updates the database header for the affected database copy, inserting the new required log range.
    4. The Mailbox server notifies the Mailbox server hosting the active database copy which log files it requires.
    5. The Mailbox server receives the required log files and inspects them.
    6. The Mailbox server injects the healthy versions of the database pages it retrieved from the active database copy. The pages are written to the log buffer, and correspondingly, the page is inserted into the database cache.
    7. The Mailbox server resumes log replay.

    Page Zeroing

    Database Page Zeroing is the process where deleted pages in the database are written over with a pattern (zeroed) as a security measure, which makes discovering the data much more difficult.

    With Exchange 2007 RTM and all previous versions, page zeroing operations happened during the streaming backup process. In addition since they occurred during the streaming backup process they were not a logged operation (e.g., page zeroing did not result in the generation of log files). This posed a problem for replicated databases, as the passive copies never had its pages zeroed, and the active copies would only have it pages zeroed if you performed a streaming backup. So in Exchange 2007 SP1, we introduced a new optional online maintenance task, Zero Database Pages during Checksum (for more information, see Exchange 2007 SP1 ESE Changes – Part 2). When enabled this task would zero out pages during the Online Maintenance Window, logging the changes, which would be replicated to the passive copies.

    With the Exchange 2007 SP1 implementation, there is significant lag between when a page is deleted to when it is zeroed as a result of the zeroing process occurring during a scheduled maintenance window. So in Exchange 2010 SP1, the page zeroing task is now a runtime event that operates continuously, zeroing out pages typically at transaction time when a hard delete occurs.

    In addition, database pages can also be scrubbed during the online checksum process. The pages targeted in this case are:

    • Deleted records which couldn’t be scrubbed during runtime due to dropped tasks (if the system is too overloaded) or because Store crashed before the tasks got to scrub the data;
    • Deleted tables and secondary indices. When these get deleted, we don’t actively scrub their contents, so online checksum detects that these pages don’t belong to any valid object anymore and scrubs them.

    For more information on page zeroing in Exchange 2010, see Understanding Exchange 2010 Page Zeroing.

    Why aren’t these tasks simply performed during a scheduled maintenance window?

    Requiring a scheduled maintenance window for page zeroing, database defragmentation, database compaction, and online checksum operations poses significant problems, including the following:

    1. Having scheduled maintenance operations makes it very difficult to manage 24x7 datacenters which host mailboxes from various time zones and have little or no time for a scheduled maintenance window. Database compaction in prior versions of Exchange had no throttling mechanisms and since the IO is predominantly random, it can lead to poor user experience.
    2. Exchange 2010 Mailbox databases deployed on lower tier storage (e.g., 7.2K SATA/SAS) have a reduced effective IO bandwidth available to ESE to perform maintenance window tasks. This is an issue because it means that IO latencies will increase during the maintenance window, thus preventing the maintenance activities to complete within a desired period of time.
    3. The use of JBOD provides an additional challenge to the database in terms of data verification. With RAID storage, it's common for an array controller to background scan a given disk group, locating and re-assigning bad blocks. A bad block (aka sector) is a block on a disk that cannot be used due to permanent damage (e.g. physical damage inflicted on the disk particles). It's also common for an array controller to read the alternate mirrored disk if a bad block was detected on the initial read request. The array controller will subsequently mark the bad block as “bad” and write the data to a new block. All of this occurs without the application knowing, perhaps with just a slight increase in the disk read latency. Without RAID or an array controller, both of these bad block detection and remediation methods are no longer available. Without RAID, it's up to the application (ESE) to detect bad blocks and remediate (i.e., database checksumming).
    4. Larger databases on larger disks require longer maintenance periods to maintain database sequentiality/compactness.

    Due to the aforementioned issues, it was critical in Exchange 2010 that the database maintenance tasks be moved out of a scheduled process and be performed during runtime continuously in the background.

    Won’t these background tasks impact my end users?

    We’ve designed these background tasks such that they're automatically throttled based on activity occurring against the database. In addition, our sizing guidance around message profiles takes these maintenance tasks into account.

    How can I monitor the effectiveness of these background maintenance tasks?

    In previous versions of Exchange, events in the Application Log would be used to monitor things like online defragmentation. In Exchange 2010, there are no longer any events recorded for the defragmentation and compaction maintenance tasks. However, you can use performance counters to track the background maintenance tasks under the MSExchange Database ==> Instances object:

    Counter Description
    Database Maintenance Duration The number of seconds that have passed since the maintenance started for this database. If the value is 0, maintenance has been finished for the day.
    Database Maintenance Pages Bad Checksums The number of non-correctable page checksums encountered during a database maintenance pass
    Defragmentation Tasks The count of background database defragmentation tasks that are currently executing
    Defragmentation Tasks Completed/Sec The rate of background database defragmentation tasks that are being completed

    You'll find the following page zeroing counters under the MSExchange Database object:

    Counter Description
    Database Maintenance Pages Zeroed Indicates the number of pages zeroed by the database engine since the performance counter was invoked
    Database Maintenance Pages Zeroed/sec Indicates the rate at which pages are zeroed by the database engine

    How can I check whitepace in a database?

    You will need to dismount the database and use ESEUTIL /MS to check the available whitespace in a database. For an example, see http://technet.microsoft.com/en-us/library/aa996139(v=EXCHG.65).aspx (note that you have to multiply the number of pages by 32K).

    Note that there is a status property available on databases within Exchange 2010, but it should not be used to determine the amount of total whitespace available within the database:

    Get-MailboxDatabase MDB1 -Status | FL AvailableNewMailboxSpace

    AvailableNewMailboxSpace tells you is how much space is available in the root tree of the database. It does not factor in the free pages within mailbox tables, index tables, etc.  It is not representative of the white space within the database.

    How can I reclaim the whitespace?

    Naturally, after seeing the available whitespace in the database, the question that always ensues is – how can I reclaim the whitespace?

    Many assume the answer is to perform an offline defragmentation of the database using ESEUTIL. However, that's not our recommendation. When you perform an offline defragmentation you create an entirely brand new database and the operations performed to create this new database are not logged in transaction logs. The new database also has a new database signature, which means that you invalidate the database copies associated with this database.

    In the event that you do encounter a database that has significant whitespace and you don't expect that normal operations will reclaim it, our recommendation is:

    1. Create a new database and associated database copies.
    2. Move all mailboxes to the new database.
    3. Delete the original database and its associated database copies.

    A terminology confusion

    Much of the confusion lies in the term background database maintenance. Collectively, all of the aforementioned tasks make up background database maintenance. However, the Shell, EMC, and JetStress all refer to database checksumming as background database maintenance, and that's what you're configuring when you enable or disable it using these tools.


    Figure 1: Enabling background database maintenance for a database using EMC

    Enabling background database maintenance using the Shell:

    Set-MailboxDatabase -Identity MDB1 -BackgroundDatabaseMaintenance $true


    Figure 2: Running background database maintenance as part of a JetStress test

    My storage vendor has recommended I disable Database Checksumming as a background maintenance task, what should I do?

    Database checksumming can become an IO tax burden if the storage is not designed correctly (even though it's sequential) as it performs 256K read IOs and generates roughly 5MB/s per database.

    As part of our storage guidance, we recommend you configure your storage array stripe size (the size of stripes written to each disk in an array; also referred to as block size) to be 256KB or larger.

    It's also important to test your storage with JetStress and ensure that the database checksum operation is included in the test pass.

    In the end, if a JetStress execution fails due to database checksumming, you have a few options:

    1. Don’t use striping  Use RAID-1 pairs or JBOD (which may require architectural changes) and get the most benefit from sequential IO patterns available in Exchange 2010.
    2. Schedule it  Configure database checksumming to not be a background process, but a scheduled process. When we implemented database checksum as a background process, we understood that some storage arrays would be so optimized for random IO (or had bandwidth limitations) that they wouldn't handle the sequential read IO well. That's why we built it so it could be turned off (which moves the checksum operation to the maintenance window).

      If you do this, we do recommend smaller database sizes. Also keep in mind that the passive copies will still perform database checksum as a background process, so you still need to account for this throughput in our storage architecture. For more information on this subject see Jetstress 2010 and Background Database Maintenance.

    3. Use different storage or improve the capabilities of the storage  Choose storage which is capable of meeting Exchange best practices (256KB+ stripe size).

    Conclusion

    The architectural changes to the database engine in Exchange Server 2010 dramatically improve its performance and robustness, but change the behavior of database maintenance tasks from previous versions. Hopefully this article helps your understanding of what is background database maintenance in Exchange 2010.

    Ross Smith IV
    Principal Program Manager
    Exchange Customer Experience