Note: I have recently changed roles to become part of the Information Protection Team at Microsoft (the group responsible for building AD RMS and related technologies) where I will be acting as a Sr. Program Manager. Since the team already has a blog on AD RMS I have decided to concentrate my efforts in that blog, which you can find at http://blogs.technet.com/rms. My previous blog posts have already been moved there and in the future you should go to that blog for updates and news (quite a few of them are coming!).
You can find this particular post at http://blogs.technet.com/b/rms/archive/2012/04/16/ad-rms-redundancy-and-fault-tolerance-part-1.aspx.
A very common question when deploying a service in the enterprise is how to make it resilient. In the case of AD RMS this typically means setting up the infrastructure so server failures don’t affect the user’s ability to consume content. In this post we will cover the back-end side of the solution, and in a future post we will discuss the making AD RMS servers themselves resilient.
The first thing to consider when designing AD RMS for high availability is that, once activated, clients in many cases don’t need to access the AD RMS server infrastructure for some of the RMS-related operations they might want to perform. Content protection with Office applications is always performed offline at the clients, so even if a client can’t reach an AD RMS server, the user will be able to protect a document or email without problems. When protecting a document, a client will also issue itself an Author license, which will allow the user that has just protected the document to continue consuming the document without having to contact the server, even after closing it and reopening it.
Users consuming content protected by others might also be able to make use of the content without having to first contact the server. That depends on several factors: if they have acquired a license before, if the use license is cacheable and if the license is still within its validity period. The possibility of caching a license is defined either in the template used to protect the document or the setting to “Require a connection to verify a user’s permission” that can be used when manually protecting a document (this setting can also be pre-set at the author’s machine via registry overrides). Cacheable licenses are valid for up to one year but the maximum duration can be specified in a template.
A user might have acquired a license before if the document (email or attachment) was pre-licensed by Microsoft Exchange. In these cases, Exchange will acquire a license on behalf of the user before delivering the email and any attachments to the client, so the user won’t have to get a license at the time the content needs to be consumed.
In any of these cases when the user has a valid license in the machine to consume the document, the client won’t need a connection to the AD RMS server in order to use the document, so whether the server is working and accessible or not won’t have an impact in the user’s ability to consume the content.
But for the initial consumption of a non-prelicensed piece of content or for consuming a document after any previously acquired license has expired, the client will have to contact the AD RMS server. A client will also need to contact the AD RMS server the first time it is used on a machine, or when it is time to renew the client machines or the users certificates, which typically last one year.
Since those are not all that uncommon situations you should in most cases ensure your AD RMS infrastructure is highly available.
Let’s review what components need to be made highly available for AD RMS to be.
Besides the client components, the key elements that make AD RMS tick are the AD RMS servers themselves, the RMS Database and Active Directory. It should be fairly obvious that any AD infrastructure needs to be made highly available, and most AD deployments are made with that in mind. One thing that is often missed though is that an AD server will only talk to an AD Global Catalog server for tasks such as group expansion, and that means that the AD RMS servers have to always have access to an AD GC in order to work effectively. While the AD Caching database that’s part of the AD RMS database infrastructure can provide some ability to continue operation when a GC is not available, it is not substitute for a GC located close to the AD RMS servers, and since any DC might be out of service at some points, you should have at least two GCs in the same network or site as the AD RMS server.
The next element to consider is the AD RMS Database. AD RMS uses the database for multiple tasks, including:
The first thing to note is that AD RMS retrieves the server’s configuration and the cluster keys when the service is started. So once an AD RMS server is up and running it will continue to work even when the database is unreachable.
Also, most of the write operations between an AD RMS server and its database server are done via the intermediate agent of Message Queuing running on the AD RMS host. That means that if the DB is not available at one point AD RMS will continue to perform these operations while keeping the information in a queue in local memory until the database becomes available again, and it will dump the data into the database whenever it becomes available again. This means that AD RMS can continue to work for long periods of time without access to the AD RMS database and logging information will continue to be gathered.
Regarding the caching DB, AD RMS will only use it if it is available. If the RMS DB is unreachable, AD RMS will try to contact a Global Catalog for fresh information about users and groups, and other than a potential load increase in AD this shouldn’t affect performance of the service.
Which leaves us with the last point in the list above: storing and retrieving copies of the user’s RACs. This is the only recurrent frequently performed operation that needs to be performed with access to the RMS Database. AD RMS needs access to the RMS DB every time a user is activated on a new machine in order to check for a pre-existing RAC and in order to save a RAC when one is created. It also needs access to pre-existing RACs when Exchange needs to pre-license content, since in order for Exchange to request a use license on behalf of the user the server needs to know the public key of the user it needs to be issued for.
So what does this all mean?
It means that if the AD RMS database fails, AD RMS will continue to operate almost normally, with only partial loss of functionality. What functionality will be lost?
Of these, the first two should normally be the limiting factors with the other issues being generally tolerable in most cases for short periods of time. Which means that if an AD RMS Database is down or inaccessible for a few minutes, or in many cases even for a few hours, most users won’t notice a problem and will continue to work mostly unaffected.
Does this mean that the AD RMS Database is not important? No! It is of uttermost importance to the AD RMS system. If the database is gone for good, so is your protected data, thus the RMS DB needs to have some reasonable level of protection from failure. But it does mean that the Database’s continuous availability is not as essential as it would be for many other applications relying on a database.
So let’s review some of the most obvious alternatives for protecting the RMS DB against failure.
A Failover cluster seems like a good option at first glance, but when we look at it in more depth we can see it provides the sort of protection that AD RMS doesn’t really need. The strength of a failover cluster is that it provides almost immediate (sub-minute) recovery for hardware, operational or application failures, and it provides the ability to perform some types of server maintenance such as server patching without affecting the service. But a failover cluster doesn’t provide much protection against data-centric failures, such as a storage unit failure or an operational or software error that could corrupt the data. Granted, these events are very infrequent, or at least they should be, but that is the sort of event that would take a long time to recover. Since AD RMS isn’t seriously affected by brief interruptions in the database, having instantaneous recovery of the DB provides only marginal value, whereas something that protects the service against longer interruptions would be highly recommended. So we conclude that using a failover cluster for hosting the AD RMS DB is not the most efficient use of resources, as it is an expensive configuration that adds little value, and doesn’t protect us against the type of events that should concern us most.
A warm standby provides a different type of protection. While they can in theory recover the DB to a working state in a few minutes, most of these solutions require some manual intervention so it is a good assumption that up to one hour, sometimes more, could pass before the DB is fully recovered form a serious failure. But that’s generally not a problem for AD RMS. Users can typically go for one hour without Exchange pre-licensing as long as they can acquire a license when they want to consume the content. New users or users setting up a new machine during one specific hour when the service is operating on contingency mode shouldn’t be that many, and they will typically have other issues during the first few hours after installing the machine that not being able to activate AD RMS at the first try, while a nuisance, shouldn’t be a big problem. And during a period when you are trying to recover the AD RMS form a catastrophic database failure, not being able to create AD RMS consumption reports or modify policy templates should be the least of your concerns.
So we can see that this is a valid and very useful configuration that can protect us against the type of problem we should be concerned about at a generally lower cost than a failover cluster.
But so will a good backup strategy. Unless you have a backup solution that cannot recover a server within a few hours, you might be able to do quite well without a hot or warm standby solution. Of course, your backup system needs to be a well-oiled process, your backups need to be tested and you have to have some sort of hardware spares to rebuild the DB when needed. It is troubling to see how many companies have a thorough backup solution in place but don’t have a good recovery plan for when things fail. But as long as you can trust that your backups work and that you will be able to recover them to working hardware within a few hours (and maybe even apply your last transaction logs from the original database if the data is still accessible and you have the DB set to full recovery model) you should be good with that. AD RMS will continue to run after reconnecting to the DB and most users shouldn’t even notice the interruption.
One caveat here: in order to be able to recover the AD RMS DB in another system, you will need to have configured your AD RMS database to use a DNS alias to refer to the database server. This is a general recommendation that should always be followed with AD RMS: don’t call the Database server by its proper name, use an alias (CName or manual A record in DNS) during setup to point AD RMS to the DB server. You will regret it if you don’t as recovering from a server failure and maybe even during future upgrade and migration processes. And the same applies to the AD RMS servers themselves: always use an alias, and not the server’s own names for the AD RMS server URLs. This will avoid some grief in the future, guaranteed.
So now we know what we need to do to have an AD RMS database and a directory server that is resilient enough to provide support to an AD RMS server infrastructure that’s highly available. So what about the AD RMS servers themselves? Well, this post is long enough already, so let’s leave that for my next post.
great information. Thank you.
One more thing. How is the template scenario handled? Are they all cached in memory or the file system on the RMS servers so that the server can still evaluate templates in case of an unavailable DB?
I think one point you should make clearer is that all DB solutions except the failover cluster require manual actions to re-establish the DB connection. So if you have a global company w/o 24/7 support and the failure takes place on a friday afternoon you might have a longer downtime for new user activation. In case of a cluster the admin comes back into office on Monday and re-establishs redundancy of the cluster first thing in the morning.
Or have I overseen something here?
This is a great blog. When will part 2 be available? We are in the process of planning our deployment including redundancy
Sorry about the delay in responding.
Part 2 is available now, so I hope it will still be useful to you.
Again, sorry about the delay.
Regarding the templates question yes, templates are cached in each server's memory after being accessed for the first time so template-based license issuance should be available if the DB fails after having been in use for a while.
Good point about log shipping requiring manual failover which might not be available in organizations without 24/7 support. Keep in mind though that the most common types of DB failures with modern hardware are not hardware related. Redundant power supplies, SANs and RAIDs, ECC memory and other redundant hardware components make hardware errors quite rare in a typical Database server these days, and most downtime is either planned or due to human error, which in some cases are not compeltely addressed by a failover cluster. Log shipping can take much longer and require manual intervention, but there are very few failure modes that it can't address.
YMMV of course and it is up to each company to analyze their failure patterns and decide wether a failover cluster, log shipping or some other mechanism will provide better results for them.
In doubt, nothing prevents implementing both a failover cluster AND Log Shipping (to either another cluster or a stand-alone server), getting you the best of both worlds (at a price).
Log shipping provide cost effectiveness but not 100% surity of service continous . often the requirment from companies are that the ervice should be available 42/7. if the company can support O&M cost the failover is a good option.
That is a good point, but the key here is that for certain functionality 24/7 operation can be achieved without 100% database uptime.
Exchange Prelicensing and related operations do need that the AD RMS database in the certification cluster is available to be able to perform its duties (more on that in a future post). But for environments where end users perform protection and consumption from Outlook and there's no Exchange server-side licensing, the database can be down for a good while without the users feeling its impact. For those environments, log shipping can be an ideal option.