The host node in the failover cluster performs a "looks alive" check every 5 seconds. An IsAlive check is performed every 60 seconds using SELECT @@SERVERNAME. If this fails the IsAlive retries 5 times and then attempts to reconnect to the instance of SQL. If all fail, then the SQL Server resource fails. Depending on the failover threshold, configuration of SQL resource, Windows Clustering will either attempt to restart on same node or failover to another available node.
During failover, Windows Clustering starts the SQL Server service for that instance on the new node, and goes through the recovery process to start the databases. After the service is started and the master database is online, the SQL Server resource is considered to be up. User databases will then go through the normal recovery process: any completed transactions in the t-log are rolled forward, and any incomplete transactions are rolled back. The length of the recovery process is dependent on how much activity must be rolled forward or rolled back upon startup.
Set the recovery interval of the server to a low number to avoid long recovery times and to speed up the failover process. SQL Server generates automatic checkpoints based on the "recovery interval" setting. Long running transactions can lead to much longer restart times than specified in the recovery interval option.
Failover/Failback Strategies
The cluster group containing SQL Server can be configured for automatic failback to the primary node when it becomes available again. By default, this is set to off.
To Configure:
Configure Node Failover Preferences
When you use more than 2 nodes, it's important to consider which nodes should own resources in the event of a failover. For example, in an n+1 configuration, each SQL Server group should have the idle node second in the list of preferred owners. N.B. Do not use cluster admin to remove nodes from the resource definition. USe SQL Server setup for that functionality.
Configure Thresholds for a Resource
Configure Thresholds for a Group
Cluster Resource Dependencies
Cluster Heartbeat
Cluster nodes use the "heartbeat" signal to check whether each node is alive at both the OS level and SQL Server level. The node hosting the SQL Server resources uses the Service Control Manager to check every 5 seconds whether the SQL Server service appears to be running. This "LooksAlive" check does not impact performance but does not perform a thorough check; the check will succeed if the service appears to be running even though it might not be operational. As a result, a deeper check must be performed; this "IsAlive" check runs every 60 seconds.
IsAlive:
If IsAlive Check fails:
The IsAlive query tolerates a few errors, but ultimately it fails if it's threshold is exceeded
During failover of the SQL Server instance, SQL Server resources start up on the new node and SQL Server goes through the recovery process to restart the databases. After the service is started and the master database is alive, the SQL Server resource is considered to be up. User databases will go through the normal recovery process. Completed transactions in the transaction log are rolled forward (the Redo phase), incomplete transactions are rolled back (the Undo phase).
In SQL Server 2005 Enterprise Edition, each user database is available to the user once the Redo phase is complete. For all other editions (and all 2000 editions), each user database is unavailable until the Undo phase completes. Length of recovery process depends on how much activity needs to be rolled forward or back upon startup.
The 'recovery interval' sp_configure option of the server can be set to a low number to avoid longer Redo recovery times and to speed up the failover process. Undo recovery time can be reduced by using shorter transactions so that uncommitted transactions do not have much to roll back.
Recommended Heartbeat Configurations
Recommended Configuration for Private Adaptor in Windows 2000 and Windows 2003