Mat Stephen's SQL Server WebLog

All things SQL Server. Complied by Matthew Stephen - SQL Server Product Specialist, Microsoft UK

Blogs

SQL Server Disaster Recovery (DR) and SANs, Remote Mirrors and Geo-Clusters

  • Comments 2
  • Likes

After some positive comments arising from my High Availability post I’ve decided to Blog on DR earlier than I had intended.  I hadn’t intended to blog on DR for a while because it’s a difficult subject and I was hoping it might get easier if I left it to stew a little longer.

 

Okay so let’s start by having a clear definition of Disaster Recovery.  (Disaster Recovery sounds like an oxymoron; if you can recover from a disaster, was it really a disaster in the first place?)  A disaster is the loss of a data centre.  This could be a loss of power, flood, fire, earthquake or train/plane crash.  This is to separate DR from High availability (HA).  High availability refers to the recovery from a probable hardware failure or even more probable human moment of lunacy.  (It’s a sad fact that non-failover clustered systems have a better up time than failover clustered systems, the biggest influence here being the ignorance of those people with access to the systems.)

 

DR is all about getting backups of your data offsite.  The only real questions you have to ask are: how much data are you prepared to loose in the event of a disaster, how long can you afford to be ‘down’ and how much money do you have to spend? (Similar to the questions you ask when planning your HA).

 

So simple DR, used by many, is to simply take a backup every so often and take it off site.  Obviously there’s plenty of scope for loosing data and being down for a good while, but hey a disaster’s a disaster isn’t it.   Yeh – you all know about this easy stuff, what you all want to know about is the zero data loss, dual data centre strategy, don’t you.

 

There are two dual data centre strategies, Remote Mirrored data and Geo-Cluster (Stretch Cluster).  Both of these strategies currently depend on a SAN infrastructure to replicate the data.  The Mirrored data strategy relies on human intervention to invoke the failover while, as the name suggests, a Geo-Cluster automates the failover like a normal failover cluster.

 

Remote Mirroring:

 

      Independent SQL Server at each site

      SAN to SAN data copy

      User databases (including active log) copied in real time

      System Catalog not copied. User procedures required to keep them in sync

      For setup & fail-over

      Third party & user procedures for automation

      Mount volumes at secondary site host & attach to SQL Server

      The independent SQL Server at each site can be MSCS clusters

      For detail refer http://www.microsoft.com/technet/prodtechnol/sql/2000/deploy/hasog05.mspx

 

Geo-Cluster

 

      Third party extensions to MSCS cluster offering

      Listed in Windows Server Catalog
http://www.microsoft.com/windows/catalog/server/default.aspx?xslt=categoryproduct&subid=22&pgn=b55095f4-71f3-4b26-98b1-05f3a9506d0d

      Virtual SQL Server offered to clients

      SAN to SAN data copy including system catalog

      Requires synchronous copy

      For setup & fail-over

      Third party & user procedures to augment Microsoft provided procedures

      For detail refer http://www.microsoft.com/technet/prodtechnol/sql/2000/deploy/hasog05.mspx

 

Majority Node Clusters and ‘split brains’

 

Big problem with geo-clusters has traditionally been with the concept of the ‘split brain’.  Split brain occurs when the heart beat between the cluster nodes is broken by a failure in the network (man drives JCB through cable).  If this occurs the failover node will try to come up whilst the primary is still functioning – ooops.  Majority node clusters overcome this problem by maintaining an odd number of servers in the cluster such that if the heart beat network fails the data center with the majority of the nodes assumes responsibility for the database.  Please see

http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/qsvclust.mspx

 

Kinda sounds simple so far?  Well it all starts to get more complicated when you start to look at how the data is mirrored.  There several ways to remotely copy (mirror) your data:

 

      Synchronous

Data pages are written synchronously at either end of the mirrored set and SQL Server isn’t told of a successful write until the writes are completed at both ends.  This is the only mode to use with Geo-Clusters.

      Semi-synchronous

Some kind of weird scenario that doesn’t seem to curry much favour

      Asynchronous

Acknowledgments for writes are given to SQL Server as soon as the data is written at the primary site.  Data is written at the secondary site as and when.

      Software-based Remote Copy

Self explanatory – Veritas etc

      Adaptive, Remote, Peer-to-Peer

No inherent D/R capability on its own – lets not go here.

      Multi-Hop, Three Data-Center

Designed for long distance DR.  A synchronous copy to nearby storage and then an asynchronous copy to a distant storage.  Means the source system doesn’t have to incur the latency to the distant storage.

 

Each method is usually called something different by the big four SAN Vendors (HP, Hitachi, EMC and IBM) and the implementation can vary too, so you can see this is now getting complicated.

 

But hang on which methods are supported by Microsoft?  Remote mirroring is supported with the caveat that any data integrity issues related to data replication should be addressed by the customer to the hardware vendor and not to Microsoft.

SQL Server requires that ‘write ahead logging’ be honoured, there be no torn pages and pages arrive in the correct order.  The mirroring configuration needs to support this, but don’t take my word as a definitive statement of the requirements – you must discuss this with the hardware vendor and make sure they’re prepared to support you.  I suggest you properly test the configuration by pulling the plug on the primary site.

 

Key considerations and comparison for Remote Mirroring & Geo-Clustering

      Both

      Dependent on 3rd. party enhancements

      Data Integrity requires that 3rd. party enhancements must ensure “Write Sequence Dependency” & protection from “Torn Pages”

      Geo-Clustering

      Offers automatic failure detection & fail-over automation

      Requires less work – no extra logins & schema synchronization – single image due to virtual SQL Server

      Offers an additional option for client redirection – use of virtual name

      Requires down time for SQL Server 2000 maintenance

      Remote Mirroring

      Requires user developed procedures for failure detection & fail-over

      Requires more work for logins & schema synchronization – independent SQL Servers

      Offers an additional option for data copy – Asynchronous (with less adverse performance impact, but with potential data loss) besides synchronous

      Offers potential continuous availability during maintenance with planned fail-over. Stretch Clustering along with Log Shipping offers similar functionality

 

Latency

 

This has to get a mention as a concern.  Naturally you’ll be looking to do a commit at the remote site and apparently you can expect this to take 3ms or so I’ve heard.  Then you’ve got to add the amount of time it takes to get to the remote sight and back at the speed of light.  At 3*10^8 ms light would take 6ms seconds to go to a remote site 100km away.  Shouldn’t be too bad, but I gather its SQL flushing the cache that puts the system under stress, merely because of the volume data that suddenly decides to flow down the pipe.  

 

Examples:

Danfoss in Denmark are using data mirroring between two towns – this case study shows their SAP architecture:

http://www.microsoft-sap.com/docs/Danfoss.doc

Akzo Nobel, great Geo-Cluster example with two unmanned data centres using IBM kit shark storage (ESS).  This cases study is amusing: we found it on the IBM website, I gave them a ring to find out more about it, they took it off their website within 24 hours.  I guess they might have decided it was a better case study for Microsoft than it was for IBM, after I had come sniffing around.

http://www.microsoft-sap.com/docs/AKZO%20Nobel.pdf

 

 

For more information about EMC SRDF solutions, visit the following EMC Web site:

http://www.emc.com/products/networking/srdf.jsp

 

EMC: Symmetrix Remote Data Facility (SRDF) Connectivity Solutions
https://powerlink.emc.com/HighFreq/P_C852.2_srdf_connectivity_sol.wp_ldv.pdf

 

Hitachi: Disaster Recovery Issues and Solutions http://www.hds.com/products_services/storage_mgmt_software/business_continuity/

 

HP: StorageWorks Data Replication Manager http://h18004.www1.hp.com/products/sanworks/drm/index.html

 

I’d be very interested in your experiences with this type of technology, if you’d care to share them.

Comments
  • Interesting post ...
    BTW, Note that the Hitachi link above is broken:

    http://www.hds.com/pdf/wp117_disaster_recov.pdf#view=FitH&pagemode=bookmarks

  • The problem I keep running into with this type of setup is that everyone wants it, but noone wants to PAY for it. These setups are NOT cheap. Can anyone setup anything like this for less than $100K? If so, let me know!