(This is a follow on to the previous post on measuring business impact, and the first post on the business case for Exchange 2007, and are my own thoughts on the case for moving to Exchange 2007). It's part of a series of posts which I'm trying to keep succinct, though they tend to be a bit longer than usual. If you find them useful, please let me know...)

GOAL: Reduce the backup burden

Now I'm going to start by putting on the misty rose-tinted specs and think back to the good old days of Exchange 4.0/5.x. When server memory was measured in megabytes and hard disk capacity in the low Gbs, there were much lower bottlenecks to performance than exist today.

Lots of people deployed Exchange servers with their own idea of how many users they would "fit" onto each box - in some cases, it would be the whole organisation; in others, it would be as many users as that physical site would have (since good practice was then to deploy a server at every major location); some would be determined by how many mailboxes that server could handle before it ran out of puff. As wide area networks got faster, more reliable and less expensive, and as server hardware got better and cheaper, the bottleneck for lots of organisations stopped being about how many users the server could handle, and more about how many users was IT comfortable in having the server handle.

On closer inspection, this "comfort" level would typically come about for 2 reasons:

  • Spread the active workload - If the server goes down (either planned or unplanned), I only want it to affect a percentage of the users rather than everyone. This way, I'd maybe have 2 medium-sized servers and put 250 users on each, rather than 500 users on one big server.
  • Time to Recovery is lower - If I had to recover the server because of a disaster, I only have so many hours (as the SLA might state) to get everything back up and running, and it will take too long to restore that much data from tape. If I split the users across multiple servers, then the likelihood of a disaster affecting more than one server may be lower, and,  in the event of total site failure, the recovery of multiple servers can at least be done in parallel.

(Of course, there were other reasons, initially - maybe people didn't believe the servers would handle the load, so played safe and deployed more than they really needed... or third party software, like Blackberry Enterprise Server, might have added extra load so they'd need to split the population across more servers).

So the ultimate bottleneck is the time it takes for a single database or single server's data to be brought back online in the event of total failure. This time will be a function of how fast the backup media was (older DAT type tape backup systems might struggle to do 10Gb/hr, whereas a straight-to-disk backup might do 10 or 20 times that rate), and is often referred to in mumbo-jumbo whitepaper speak as "RTO" or Recovery Time Objective. If you've only got 6 hours before you need to have the data back online, and it takes 20Gb/hr to recover the data from your backup media, then at a maximum you could only afford to have 120Gb to be recovered and still have a hope of meeting the SLA.

There are a few things that can be done to mitigate this requirement:

  • Agree a more forgiving RTO.
  • Accept a lower RPO (Recovery Point Objective is, in essence, the stage you need to get to - eg have all the data back up and running, or possibly have service restored but with no historical data, such as with dial-tone recovery in Exchange).
  • Reduce the volume of data which will need to be recovered in series - by separating out into multiple databases per server, or by having multiple servers.

Set realistic expectations

Now, it might sound like a non-starter to say that the RTO should be longer, or the RPO less functional - after all, the whole point of backup & disaster recovery is to carry on running even when bad stuff happens, right?

It's important to think about why data is being backed up in the first place: it's a similar argument to using clustering for high availability. You need to really know if you're looking for availability, or recoverability. Availability means that you can keep a higher level of service, by continuing to provide service to users even when a physical server or other piece of infrastructure is no longer available, for whatever reason. Recoverability, on the other hand, is the ease and speed with which service and/or data can be brought online following a more sever failure.

I've spoken with lots of customers over the years who think they want clustering, but in reality they don't know how to operate a single server in a well-managed and controlled fashion, so adding clusters would make things less reliable, not more. I've also spoken with customers who think they need site resilience, so if they lose their entire datacenter, they can carry on running from a backup site.

Since all but the largest organisations tend to run their datacenters in the same place where their users are (whether that "datacenter" is a cupboard under the stairs or the whole basement of their head office), in the event that the entire datacenter is wiped out, it's quite likely that they'll have lots of other things to worry about - like where the users are going to sit? How is the helpdesk going to function, and communicate effectively with all those now-stranded users? What about all the other, really mission critical applications? Is email really as important as the sales order processing system, or the customer-facing call centre?

In many cases, I think it is acceptable to have a recovery point objective of, within a reasonable time, delivering a service that will enable users to find each other and to send & receive mail. I don't believe it's always worth the effort and expense that would be required to bring all the users' email online at the same time - I'd rather see mail service restored within an hour, even if it takes 5 days for the historical data to come back, compared to 8 hours for restoring any kind of service which included all the old data.

How much data to fit on each server in the first place

Microsoft's best practice advice has been to limit the size of each Exchange database to 50Gb (in Exchange 2003), to make the backup & recovery process more manageable. If you built Exchange 2003 servers with the maximum number of databases, this would set the size "limit" of each server to 1Tb of data. In Exchange 2007, this advisory "limit" has been raised to 100Gb maximum per database, unless the server is replicating the data elsewhere (using the Continuous Replication technology), in which case it's 200Gb per database. Oh, and Exchange 2007 raises the total number of databases to 50, so in theory, each server could now support 10Tb of data and still be recoverable within a reasonable time.

The total amount of data that can be accommodated on a single server is often used to make a decision about how many mailboxes to host there, and how big they should be - it's pretty common to see sizes limited to 200Mb or thereabouts, though it does vary hugely (see the post on the Exchange Team blog from a couple of years ago to get a flavour). Exchange 2007 now defaults to having a mailbox quota of 10 times that size: 2Gb, made possible through some fundamental changes to the way Exchange handles and stores data.

Much of this storage efficiency now derives from Exchange 2007 running on 64-bit (x64) servers, meaning there's potentially a lot more memory available for the server to cache disk contents in. A busy Exchange 2003 server (with, say, 4000 users), might only have enough memory to cache 250Kb of data for each user - probably not even enough for caching the index for the user's mailbox, let alone any of the data. In Exchange 2007, the standard recommendation would be to size the server so as to have 5Mb or even 10Mb of memory for every user, resulting in dramatically more efficient use of the storage subsystem. This pay-off means that a traditional performance bottleneck on Exchange of the storage subsystem's I/O throughput, is reduced considerably.

NET: Improvements in the underlying storage technology within Exchange 2007 mean that it is feasible to store a lot more data on each server, without performance suffering and without falling foul of your RTO/SLA goals.

I've posted before about Sizing Exchange 2007 environments.

What to back up and how?

When looking at backup and recovery strategies, it's important to consider exactly what is being backed up, how often, and why.

Arguably, if you have a 2nd or 3rd online (or near-online) copy of a piece of data, then it's less important to back it up in a more traditional fashion, since the primary point of recovery will be another of the online copies. The payoff for this approach is that it no longer matters as much if it takes a whole weekend to complete writing the backup to whatever medium you're using (assuming some optical or magnetic media is still in play, of course), and that slower backup is likely to be used only for long-term archival or for recovery in a true catastrophe when all replicas of the data are gone.

Many organisations have sought to reduce the volume of data on Exchange for the purposes of meeting their SLAs, or because keeping large volumes of data on Exchange was traditionally more expensive due to the requirements for high-speed (and often shared) storage. With having more memory in an Exchange server due to it being 64-bit, the hit on I/O performance can be much lower, meaning that a 2007 server could host more data with the same set of disks than an equivalent 2003 server would (working on the assumption that Exchange will have historically hit disk I/O throughput bottlenecks before running out of disk space). The simplest way to reduce the volume of data stored on Exchange (and therefore, data which needs to be backed up and recovered on Exchange), is to reduce the mailbox quota of the end users.

In the post, Exchange mailbox quotas and 'a paradox of thrift', I talked about the downside of trying too hard to reduce mailbox sizes - the temptation is for the users to stuff everything into a PST file and have that being backed up (or risk being lost!) outside of Exchange. Maybe it's better to invest in keeping more data online on Exchange, such that it's always accessible from any client (unlike some archiving systems which require client-side software, thereby rendering the data unaccessible to non-Outlook clients), not replicated to users' PCs when running in Cached Mode, and not being indexed for easy retrieval by either the Exchange Server or by the client PC.

NET: Taking data off Exchange and into either user's PST archive files, or a centralised archiving system, may reduce the utility of the information by making it less easy to find and access, and could introduce more complex data management procedures as well as potential additional costs of ownership.

Coming to a datacenter near you

An interesting piece of "sleeper" technology may help reduce the discussions of backup technique: known simply as DPM, or System Center Data Protection Manager to give it its full title. DPM has been available for a while and targeted at backing up and restoring file server data, but the second release (DPM 2007) is due soon, and adds support for Exchange (as well as Sharepoint and SQL databases). In essence, DPM is an application which runs on Windows Server, that is used to manage snap-shots of the data source(s) it's been assigned to protect. The server will happily take snaps at timely intervals and can keep them in a near-line state or archive them to offline (ie tape) storage for archival.

DPM 2007-05 graphic B

With very low cost but high-capacity disks (such as Serial-Attached SCSI arrays or even SATA disks deployed in fault-tolerant configurations), it could be possible to have DPM servers capable of backing up many Tbs of data as the first or second line of backup, before spooling off to tapes on an occasional basis for offsite storage. A lot of this technology has been around in some form for years (with storage vendors typically having their own proprietary mechanisms to create & manage the snapshots), but with a combination of Windows' Volume Shadowcopy Services (VSS), Exchange's support for VSS, and DPM's provision of the back-end to the whole process, the cost of entry could be significantly lower.

NET: Keeping online snapshots of important systems doesn't need to be as expensive as in the past, and can provide a better RTO and RPO than alternatives.

So, it's important to think about how you backup and restore the Exchange servers in your organisation, but by using Exchange 2007, you could give the users a lot more quota that they've had before. Using Managed Folders in Exchange, you could cajole the users into keeping this data more free of stuff they don't need to keep, and to more easily keep the stuff they do. All the while, it's now possible to make sure the data is backed up quickly and at much lower cost than would have been previously possible with such volumes of data.