Robert's Rules of Exchange is a series of blog posts in which we take a fictitious company, describe their existing Exchange implementation, and then walk you through the design, installation, and configuration of their Exchange 2010 environment. See Robert's Rules of Exchange: Table of Blogs for all posts in this series.
In this post, we want to discuss some of the thought process you should have as you decide what your storage infrastructure is for Exchange 2010. Further, we will discuss how and why you should test your storage. These processes belong early in the planning stages for Exchange 2010, because if you don't test your storage before moving users to that storage, it is a lot more painful to fix than if you could fix something without affecting users.
Historically, Exchange deployments have seen storage represent a significant portion of deployment costs, both capital expenditures (purchase cost of the storage) and operational expenditures (ongoing maintenance of the storage). In the Exchange 2003 timeframe, to get any sort of "high availability", you needed what we now call "single copy clusters" using shared data, and that typically meant a SAN infrastructure, which is expensive and complex. We would also see companies that would aggressively keep mailbox sizes small because of the cost of disks when requiring RAID 10 or other similar solutions to get the IO performance out of the disk subsystems. Customers would buy as much space as they could afford, and then set mailbox sizes small enough that they wouldn't run out of space on the SAN.
When users would complain about mailbox sizes, customers would go to great lengths to give more space. Thus was born the third party "stubbing" archival solution. These products would take attachments and old emails out of Exchange, replace them with a very small message "stub", and put the email data into a separate system, which had a separate storage infrastructure. So, we're adding the cost of the archival software, the cost of additional storage, the cost of teaching someone to manage that system, and the complexity of another whole system that must be monitored and managed.
Something had to change, and Microsoft heard our customers loud and clear. With Exchange 2007 we reduced IOPS/mailbox (Input/Output Operations per Second - a measure of the disk load generated) by a factor of 70% over an equivalent Exchange 2003 user. With Exchange 2010, we reduced IOPS/mailbox by 70% again over Exchange 2007. That means that Exchange 2010 generates approximately 10% of the IO requests that an equally loaded Exchange 2003 system would.
This opens up a whole new set of possibilities utilizing very large, very slow, very cheap disks such as 7200RPM, 2TB SATA or SAS disks.
One of the most important things to remember as you are planning your Exchange 2010 architecture (or any other solution, for that matter) is to keep things simple. Any time you add some sort of complexity to your solution, you raise the chance of deployment failure, you introduce the possibility that the capital expenditures or deployment costs will be higher, and you most likely raise the operational costs of the solution as well. The more complexities you add to your system, the higher the chances of failure or increased costs. So, for every single design decision we make, we will drive down our complexity factor as much as possible. For this discussion, we should consider the complexity of our storage infrastructure.
When you design a SAN infrastructure, the key is to provide enterprise storage that is highly available. This need for high availability drives complexity high. Typically, for a SAN-based storage infrastructure, your servers are connected via multiple fibre channel HBA connections to multiple, redundant fibre channel switches. Then these redundant switches are connected via redundant connections to multiple SAN I/O modules. The SAN modules are then connected (via more redundant connections) to "frames" that hold many, many disks. There are all kinds of computers and controllers and software and firmware throughout this storage infrastructure, and they all need to be managed. So if you deploy a SAN and have true high availability requirements (24x7 datacenter requirements), you end up with a staff of people trained to do nothing other than manage your storage infrastructure, including in many cases full-time consultants from the SAN vendor for the 3-5 years that you will be using that storage infrastructure. SANs have some very interesting replication technologies and backup technologies built in, but every single thing you want to add on is at a cost and at an addition of complexity.
If you contrast that with a JBOD solution, using simple servers and simple SCSI connections, where we have provided the redundancy through multiple copies of your data (let's say 2 copies in the local datacenter, and 2 copies in the remote datacenter, which seems to be a quite popular scenario), you have a very different picture. You don't need redundant network or fibre channel connections. You don't need RAID controllers (well, you do need to RAID the OS drives, for instance, but not all those Exchange data drives). You don't need redundant switches for both the MAPI network and the storage area network. You can greatly simplify the solution by having a single MAPI network connection per server, a replication network or two per server, a single connection to a set of disks that are not in a RAID configuration. We then allow Exchange to fail databases or entire servers over, if necessary. If 2 copies in a single datacenter aren't good enough for you, go to 3 copies. Or go to 4 copies. Whatever works for your environment and meets your requirements.
The significant point here is that we don't want any complexity that we can avoid. When having discussions about disks with my customers, I always start at JBOD. Not every customer is going to deploy on JBOD, but we must have the discussion. We must understand the ramifications that the complexity of anything other than a simple JBOD deployment brings to the table. I push hard to not move away from the simplest solution unless there are real requirements to do so. (And I am fond of saying that "because we don't do it that way in our shop" is not a requirement!)
Most people have low confidence with the idea of not using RAID to store their important email data. But, you should be aware that somewhere in your personal life, you are probably storing your precious data on a JBOD system - either for your email or for some other important data. Google uses JBOD in their Google File System - that's right, your Gmail email account is stored without RAID. Much of Microsoft's Office 365 cloud service utilizes JBOD for storage (the older Exchange 2007 architecture leverages RAID storage infrastructure, but the Exchange 2010 infrastructure in Office 365 is deployed using JBOD). Microsoft's own internal deployment of Exchange 2010 has over 180,000 users on a JBOD infrastructure. Further, Microsoft's TerraServer uses JBOD to store all of the data presented through that web site, so this isn't just email moving in this technological direction.
Another thing customers "push back" on is the idea that "nearline" SATA/SAS class disks fail more often than enterprise class fibre channel (FC) or SCSI disks, and that a single disk failure in a RAID solution doesn't affect service availability, so it must be a better solution. Therefore, if I have enterprise class disks in a RAID solution, they don't fail as often and when they do the user impact is lessened. To answer that, we want to look at the two claims separately. First, do midline/nearline disks fail more often than enterprise FC or SCSI disks? According to many studies ("Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?", "Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?", "Failure Trends in Large Disk Drive Population", and "Empirical Measurements of Disk Failure Rates and Error Rates"), nearline disks do not fail more often than enterprise disks, but rather all disks seem to fail more often than their published MTTF (mean time to failure) numbers.
Note that in the Microsoft's Live@EDU infrastructure, we utilize nerarline 7.2K SATA drives and we see a 5% annual failure rate (AFR), while in MSIT we leverage nearline 7.2K SAS drives and we see a 2.75% AFR there. Microsoft therefore recommends that if you are considering utilization of these nearline drives in a JBOD architecture that you do choose to do so with the 7.2K RPM SAS drives rather than SATA.
What about the impact of disk failures with RAID systems vs non-RAID systems? Do we see less user impact from a RAIDed disk failure vs a JBOD disk failure? With RAID, we have to engineer for a RAID "rebuild" factor. RAID solutions have the capability to have spares available that the system can automatically "swap in", meaning make them live and part of the RAID array. During the period while the RAID array is being rebuilt, there is an impact to the IO capabilities as seen by the operating system and therefore, by Exchange. This just makes sense because the data from the remaining members of the RAID array is being read so that the data on the failed drive can be reconstructed on the newly promoted "spare", so you have an unusually heavy read cycle on the disks that did not fail, and unusually heavy write cycle on the new disk in the array. To properly size an Exchange system, you have to plan for this RAID rebuild process to occur during the heaviest utilization period of the day, so there is a cost associated with this outside the cost of the RAID system itself - you need more disks to provide the appropriate IOPS during the rebuild period. You can see the worst case impact of this on the number of disks required in the RAID Rebuild numbers of the Exchange 2010 Mailbox Server Role Requirements Calculator. Build a RAID solution in the calculator and go to the "Storage Design" tab and change the appropriate RAID rebuild overhead percentages to see what that impact really is. Note that to get the "right" RAID rebuild overhead percentage, you should work with your storage vendor.
With Exchange 2010, we have moved most of the data resilience solution away from RAID (where the redundancy is disk based and not data aware) into the Exchange 2010 high availability architecture (where the application is certainly data aware across all instances of the data). As detailed in the TechNet article New High Availability and Site Resilience Functionality, Exchange 2010 implements the capability to have multiple copies of your email data (up to 16 copies, which is generally a bit excessive - we will typically see 2-6 copies, with both 2 and 6 being extreme cases), and even goes to great lengths to "fix" a given copy of the data if a database page is found to be inconsistent in a database using our single page restore technology. Worst case scenario is that a single copy of your data becomes corrupted to the point that it can no longer be used, and if that was the "live" copy of the database, you have a single database fail to another server. This generally will cause an outage of less than 30 seconds, and since the users are connected to the CAS server now and not the mailbox server, they might not even notice the outage!
You still might say, "But the RAID solution will automatically put a disk into the array and rebuild it for me, and JBOD just won't do that." This is correct. Somehow you have to get a spare disk online and available, formatted at the OS level, and reseed the database that was lost. The storage subsystem in some cases will be able to automatically bring a spare disk online and present it to the appropriate server. And it is a fairly straightforward process to script the mounting and formatting of the LUN, and the reseeding of the Exchange database (from passive, mind you, thus not affecting users while the seed process takes place). It isn't automatic, but it can certainly be automated! And, this would allow you to hire a young intern, get them A+ certified and let them be the person walking the datacenter floor looking for the blinking red lights and replacing disks rather than some US$200/hour consultant.
I have talked to customers that claim that storage is half of the cost of their Exchange deployment costs on a per mailbox basis. Microsoft did a very large amount of work allowing you to make storage decisions to drive those storage costs down very aggressively. In a blog post like this, I can't tell you which storage architecture to choose. Microsoft supports SAN and DAS+RAID along with DAS+JBOD because no single storage architecture is right for every customer out there.
What I can say with confidence is that unless you are in a situation where money is no object and will continue to be no object, you must seriously evaluate the storage solutions available to you. If you just deploy Exchange 2010 on a storage architecture because that's the way you did Exchange 2003, or you choose your storage architecture because the corporate policy states that you must have all data use an enterprise storage solution, then you are doing yourself and your company an injustice.
Larger, cheaper disks means that an organization that moves aggressively to a JBOD (RAID-less) direct attached storage infrastructure utilizing "big, cheap" disks can drive "per mailbox per month" costs down. When online "cloud" email providers such as Google and Microsoft are offering email services in the $5/month/mailbox range, organizations that wish to host their own email should also be looking to lower their costs at least into that same ballpark. Whether you are a company that has shareholders you answer to, or a government agency that has to answer to the taxpayer, driving costs down to better utilize the money you have in your budget while still providing a better solution for your users should be of paramount importance.
With the economy the way it is, we all need to save money everywhere we can, and I certainly would hate to be the guy that made the decision to go with the complex and expensive solution when we could have saved significant storage costs and provided a more simple, more available and more manageable solution.
Sizing storage for Exchange has always been about the tradeoff between size and performance. You use a tool like the Exchange Profile Analyzer to find your user's profiles - how many messages they send/receive per day, how large those messages are, how much data they have in their mailboxes, etc. From this information, we can estimate very closely what your IOPS requirements will be on your disk subsystems. Then you define your requirements for how large you want the mailboxes to be. We take those together and throw them into the Exchange 2010 Mailbox Server Role Requirements Calculator, and that tells us how many disks we need to meet our goals, and what type of RAID or JBOD system we need. We can then balance size vs. performance by changing the spreadsheet inputs (disk size, mailbox size, number of users that have what IO profile, etc.) to see the impacts.
We can also do "what-if scenarios" around small fast disks (using RAID) vs large slow disks (not using RAID). To see this, let's take a copy of the calculator downloaded directly from the web site and make as few changes as possible. The default numbers in the calculator have a 6-node DAG with 3 copies, all in a single datacenter. The default mailbox profile is 24,000 mailboxes, 100 messages, 75k, 2GB mailboxes and 14 day deleted item retention. Let's just play with the disks for now and not change anything else.
First, let's look at smaller, faster disks in a RAID array. I'll turn off "consider JBOD", and set all disks to be 15K RPM FC disks in a 600GB size. This solution (remember, I only changed the disks here) will take 666 disks, and this will be a size constrained solution. In fact, by adding a 2.0 IOPS multiplier for every user, we won't change the number of disks required in this solution. Removing the IOPS multiplier and halving the mailbox size to 1GB reduces the number of disks to 378. (Dropping the mailbox size to 256MB still leaves us needing 246 disks.)
Now, let's take the exact same scenario, but look at the 7.2K 2TB drives utilizing JBOD. Turn "consider JBOD" back on, and change the drives to our 7.2K RPM 2TB drives. This results in 168 drives (2GB mailboxes for 68 disks fewer than the 256MB mailboxes above!). This configuration looks fairly balanced between IOPS and disk space utilization. If I cut the mailbox size in half, the number of disks doesn't change, meaning that I am pretty much IO bound on this configuration. I can add half a GB to each mailbox (2.5GB per mailbox) and it only bumps us to 186 drives in the organization.
Exchange 2010 is still a random database by nature, and because of this, we are still concerned with the IOs generated by the mailbox when we design our storage infrastructure. But in the above example, even though it is possibly quite simplistic and only looks at the disks themselves, you can look at the costs (and I'll allow you to get your own costs for disks from your favorite vendors) like this:
168 disks (7.2K 2TB SATA) * (cost per disk) = (total disk cost)
666 disks (15K 600GB FC) * (cost per disk) = (total disk cost)
With most of the customers I work with, they are astounded at the cost difference. With PSTs being a problem (can't get to them from OWA, Windows Phone 7, iPhone or Blackberries, can't store them on a file share in a supported way, can't back up from desktops easily, can't easily search for compliance scenarios, etc), with users screaming for a larger mailbox, with the cost of a third party archive solution to relieve the pressure of mailbox size on the smaller disks. There are just so many reasons that utilizing the larger, slower and less expensive nearline disks is very attractive.
Exchange 2010 provides a new capability with the Archive Mailbox. This is, at the most basic definition, a second mailbox associated with a given user. These archive mailboxes are only available to "online" users (Outlook 2007, Outlook 2010 users currently connected to the network and OWA 2010 users), and are not available to all legacy Outlook clients (Outlook 2007 support requires the Outlook 2007 December 2010 Cumulative Update or later), offline Outlook users no matter the version or whether they have cached mode configured or not, or any POP, IMAP, Exchange ActiveSync, Blackberry, or most Exchange Web Services clients (support is there in EWS, but clients are limited at this time). Once again, in January 2011, the only clients that work are OWA 2010, Outlook 2007, and Outlook 2010 when it is able to connect to the network where you have Exchange published (Internet for Outlook Anywhere or on the corporate network for RPC/MAPI communications). That's it. This means that there are some limitations. But the capabilities are still quite interesting and attractive.
For instance, what if your users have extremely large mailboxes - say 25 or 40 GB mailboxes - and they need to travel with laptops. OST files to support mailboxes of that size would probably not work well on your typical 5400 or 7200 RPM laptop drive (but will work great on the latest generation solid state drive). It is possible that you could partition their mailboxes to give them a 2 or 5 GB primary mailbox that can be synced to the client with cached mode, and then a 20 or 35 GB archive mailbox for older, less accessed data, which would be available when the user is network connected.
Or, for another very attractive scenario, let's assume that you contract with Microsoft Office 365 to host those 35GB archive mailboxes. You host the primary mailbox, and you have Microsoft Office 365 host the online archive mailbox. That, to me, sounds like a quite interesting possibility for quite a few customers!
One thing to keep in mind is that for almost every function typically associated with the archive mailbox, those functions are also available with a simple primary mailbox. Retention policies (which are policies that delete messages based upon their age), as well as things like "legal hold" or "single item recovery" that some customers seem to associate with the archive mailbox, are all available to you even without an archive mailbox. Very few features are specific to the archive mailbox, other than the actual separation of data, and the ability to move those archive mailboxes to other physical servers, other DAGs, or even into the cloud with Microsoft's Office 365 services.
For Robert's Rules, the use of these archive mailboxes for the majority of our users is out of scope. Our idea of simplicity of deployment dictates that if we don't have a requirement driving us to implement a feature, we should not do so. We will implement some archive mailboxes to show how policies can be used to manipulate data in the separate mailboxes, as well as to show how users would use these mailboxes, and hopefully to show how I can have a hybrid solution with some archive mailboxes stored in the Office 365 cloud, but that will be just for demonstration purposes and will not be part of the primary deployment for our Robert's Rules customer.
This is another discussion I have with a lot of customers, and one where Exchange 2010 has a fantastic "story". As I mentioned above, one thing that customers did to control mailbox size in the Exchange 2003 timeframe was to implement an archival solution that utilized a "stubbing" technology of some sort or other. Based on the size of the message and/or the age of the message, the email payload in Exchange could be extracted from the Exchange database itself and stored in another system, leaving only a "pointer" or "message stub" in Exchange. This stub could be utilized to retrieve the message payload when the user wanted to open that message.
At Microsoft, our customers have told us of a few limitations of this type of system. One thing I've heard many times is that users don't like the stubbing solutions. It makes data access difficult, and the data access is never the same between two systems (say OWA and Outlook). Quite often the users cannot access their stubbed data from mobile devices, or the implementation for the mobile devices lags behind the main implementation.
Another issue we have with these stubbing or archival systems is that they bring complexity and the associated costs. As we look at our overall messaging service and try to drive down complexity and cost, this is certainly one thing we can look at as a "low hanging fruit". With Exchange 2010, Microsoft has done a lot of work to enable large mailboxes. 5GB, 10GB, even 25GB mailboxes are becoming more common at our customer locations. Utilizing the simple JBOD storage and the large, slow, cheap drives (7200 RPM 2TB drives, for instance), we can implement these large mailboxes cheaper and with a higher availability than we have ever been able to do with a previous version of Exchange.
So, the question becomes this: If you can implement your storage on the cheapest storage infrastructure possible, and you can provide your users with the capability to store everything in their mailboxes for 5 or 10 years, why would you want to add a stubbing solution, raising complexity by adding another software system and storage system to maintain?
Please keep in mind that we aren't talking about a compliance situation here. We are not talking about a situation where we need to keep every message sent through the system for 7 or 10 years, where we need complex case management systems or similar. This would be a journaling system, which Exchange does provide the technologies to interface with. That would be a very different discussion, and one that we aren't going to get into in this blog post (that's for another time).
Robert's Rules doesn't have requirement for any third party archival products - we will be implementing large mailboxes on simple, inexpensive storage for this solution.
After you have made your storage related decisions, designed your disk subsystem, purchased and implemented it all, you still need to test it. The beauty of that is that we have a great tool called JetStress that utilizes the Exchange ESE database binaries to drive a database on disks in the same manner that Exchange will in your production environment.
What is great is that just recently, a good friend of mine named Neil Johnson released his Jetstress Field Guide document on the Exchange Team Blog. What a fantastic document to help you understand exactly what you should do to test your storage subsystem, and why you need to do the things you need to do. I certainly used this when re-familiarizing myself with Jetstress just before going to my Exchange 2010 MCM rotation. I can't recommend this documentation enough for those of you that will need to run Jetstress! There is nothing I can add here that is not in that doc, so go grab it if you haven't already.
Storage and storage performance is important to Exchange 2010, just as it has been to every other version of Exchange. This is why we start all of our Exchange planning with the mailbox role servers - the storage itself and the processing power and memory necessary to provide access to that storage. It all keys from the mailbox servers.
So, as you are thinking about your storage infrastructure, remember to start simple. Design as simple as you possibly can. Only add complexity when you absolutely must, and only when based on messaging requirements. Try to break the mold of thinking about Exchange 2010 as if it needs to be designed like an Exchange 2003 environment. Try to leverage the new capabilities in Exchange 2010 to provide the functionality that your users need to better do their jobs.
Thanks to Andrew Ehrensing, Ross Smith IV and Matt Gossage for some of the links and storage information above. And as always, thanks to Ross (again) and Bharat Suneja for tech review and formatting/publishing help. I may not say it every time, but gentlemen, it is appreciated every time!