After some positive comments arising from my High Availability post I’ve decided to Blog on DR earlier than I had intended. I hadn’t intended to blog on DR for a while because it’s a difficult subject and I was hoping it might get easier if I left it to stew a little longer.
Okay so let’s start by having a clear definition of Disaster Recovery. (Disaster Recovery sounds like an oxymoron; if you can recover from a disaster, was it really a disaster in the first place?) A disaster is the loss of a data centre. This could be a loss of power, flood, fire, earthquake or train/plane crash. This is to separate DR from High availability (HA). High availability refers to the recovery from a probable hardware failure or even more probable human moment of lunacy. (It’s a sad fact that non-failover clustered systems have a better up time than failover clustered systems, the biggest influence here being the ignorance of those people with access to the systems.)
DR is all about getting backups of your data offsite. The only real questions you have to ask are: how much data are you prepared to loose in the event of a disaster, how long can you afford to be ‘down’ and how much money do you have to spend? (Similar to the questions you ask when planning your HA).
So simple DR, used by many, is to simply take a backup every so often and take it off site. Obviously there’s plenty of scope for loosing data and being down for a good while, but hey a disaster’s a disaster isn’t it. Yeh – you all know about this easy stuff, what you all want to know about is the zero data loss, dual data centre strategy, don’t you.
There are two dual data centre strategies, Remote Mirrored data and Geo-Cluster (Stretch Cluster). Both of these strategies currently depend on a SAN infrastructure to replicate the data. The Mirrored data strategy relies on human intervention to invoke the failover while, as the name suggests, a Geo-Cluster automates the failover like a normal failover cluster.
Independent SQL Server at each site
SAN to SAN data copy
User databases (including active log) copied in real time
System Catalog not copied. User procedures required to keep them in sync
For setup & fail-over
Third party & user procedures for automation
Mount volumes at secondary site host & attach to SQL Server
The independent SQL Server at each site can be MSCS clusters
For detail refer http://www.microsoft.com/technet/prodtechnol/sql/2000/deploy/hasog05.mspx
Third party extensions to MSCS cluster offering
Listed in Windows Server Cataloghttp://www.microsoft.com/windows/catalog/server/default.aspx?xslt=categoryproduct&subid=22&pgn=b55095f4-71f3-4b26-98b1-05f3a9506d0d
Virtual SQL Server offered to clients
SAN to SAN data copy including system catalog
Requires synchronous copy
Third party & user procedures to augment Microsoft provided procedures
Majority Node Clusters and ‘split brains’
Big problem with geo-clusters has traditionally been with the concept of the ‘split brain’. Split brain occurs when the heart beat between the cluster nodes is broken by a failure in the network (man drives JCB through cable). If this occurs the failover node will try to come up whilst the primary is still functioning – ooops. Majority node clusters overcome this problem by maintaining an odd number of servers in the cluster such that if the heart beat network fails the data center with the majority of the nodes assumes responsibility for the database. Please see
Kinda sounds simple so far? Well it all starts to get more complicated when you start to look at how the data is mirrored. There several ways to remotely copy (mirror) your data:
Data pages are written synchronously at either end of the mirrored set and SQL Server isn’t told of a successful write until the writes are completed at both ends. This is the only mode to use with Geo-Clusters.
Some kind of weird scenario that doesn’t seem to curry much favour
Acknowledgments for writes are given to SQL Server as soon as the data is written at the primary site. Data is written at the secondary site as and when.
Software-based Remote Copy
Self explanatory – Veritas etc
Adaptive, Remote, Peer-to-Peer
No inherent D/R capability on its own – lets not go here.
Multi-Hop, Three Data-Center
Designed for long distance DR. A synchronous copy to nearby storage and then an asynchronous copy to a distant storage. Means the source system doesn’t have to incur the latency to the distant storage.
Each method is usually called something different by the big four SAN Vendors (HP, Hitachi, EMC and IBM) and the implementation can vary too, so you can see this is now getting complicated.
But hang on which methods are supported by Microsoft? Remote mirroring is supported with the caveat that any data integrity issues related to data replication should be addressed by the customer to the hardware vendor and not to Microsoft.
SQL Server requires that ‘write ahead logging’ be honoured, there be no torn pages and pages arrive in the correct order. The mirroring configuration needs to support this, but don’t take my word as a definitive statement of the requirements – you must discuss this with the hardware vendor and make sure they’re prepared to support you. I suggest you properly test the configuration by pulling the plug on the primary site.
Key considerations and comparison for Remote Mirroring & Geo-Clustering
Dependent on 3rd. party enhancements
Data Integrity requires that 3rd. party enhancements must ensure “Write Sequence Dependency” & protection from “Torn Pages”
Offers automatic failure detection & fail-over automation
Requires less work – no extra logins & schema synchronization – single image due to virtual SQL Server
Offers an additional option for client redirection – use of virtual name
Requires down time for SQL Server 2000 maintenance
Requires user developed procedures for failure detection & fail-over
Requires more work for logins & schema synchronization – independent SQL Servers
Offers an additional option for data copy – Asynchronous (with less adverse performance impact, but with potential data loss) besides synchronous
Offers potential continuous availability during maintenance with planned fail-over. Stretch Clustering along with Log Shipping offers similar functionality
This has to get a mention as a concern. Naturally you’ll be looking to do a commit at the remote site and apparently you can expect this to take 3ms or so I’ve heard. Then you’ve got to add the amount of time it takes to get to the remote sight and back at the speed of light. At 3*10^8 ms light would take 6ms seconds to go to a remote site 100km away. Shouldn’t be too bad, but I gather its SQL flushing the cache that puts the system under stress, merely because of the volume data that suddenly decides to flow down the pipe.
Danfoss in Denmark are using data mirroring between two towns – this case study shows their SAP architecture:
Akzo Nobel, great Geo-Cluster example with two unmanned data centres using IBM kit shark storage (ESS). This cases study is amusing: we found it on the IBM website, I gave them a ring to find out more about it, they took it off their website within 24 hours. I guess they might have decided it was a better case study for Microsoft than it was for IBM, after I had come sniffing around.
For more information about EMC SRDF solutions, visit the following EMC Web site:
EMC: Symmetrix Remote Data Facility (SRDF) Connectivity Solutionshttps://powerlink.emc.com/HighFreq/P_C852.2_srdf_connectivity_sol.wp_ldv.pdf
Hitachi: Disaster Recovery Issues and Solutions http://www.hds.com/products_services/storage_mgmt_software/business_continuity/
HP: StorageWorks Data Replication Manager http://h18004.www1.hp.com/products/sanworks/drm/index.html
I’d be very interested in your experiences with this type of technology, if you’d care to share them.
Interesting post ...
BTW, Note that the Hitachi link above is broken:
The problem I keep running into with this type of setup is that everyone wants it, but noone wants to PAY for it. These setups are NOT cheap. Can anyone setup anything like this for less than $100K? If so, let me know!