In part 1 of this blog we discussed the architecture and design of Lync 2013 pChat HA\DR components. Now we will discuss how this design behaves with various failures within a Lync infrastructure. These different failure scenarios are based off our Disaster Recovery diagram from part 1 (Figure 1). We will cover the following failure scenarios:
Scenario 1A: Lync FE Pool failure (Figure 1)
Figure 1: EEPool1 fails in East Datacenter
In this scenario our Front End Pool (EEPool1) in the East Datacenter has a complete failure. In order for users to get reconnected to pChat1 we will need to failover our paired pool from the East Datacenter to the West Datacenter. This will allow EEPool2 to route pChat connections to our pChat pool. This scenario would have the same failover steps regardless of if all pChat servers are active in oncdatacenter or split between the two (discussed in part 1).
1 The same steps will be performed regardless of whether all of the pChat servers are active in a single datacenter or they are active in both datacenters.
2 The –DisasterMode parameter is used with the Invoke-CsPoolFailOver cmdlet because our Pool has completely failed.
Once the failover is complete users can reconnect and all pChat functionality will be restored.
Scenario 1B: Lync FE Pool failback
Once our Front End Pool (EEPool1) in the East Datacenter comes back online and all services are restored we need to perform a failback.
1 Although this isn't required as traffic will still route through the FE pool in the secondary datacenter, it is recommended.
Scenario 2A: Complete Site failure (Figure 2)
Figure 2: Entire East Datacenter failure – all services in this Datacenter are down (FE, pChat, SQL, File Store, Edge)
Next, we look at a complete Lync site failure. This encompasses all Lync services including FE, pChat, SQL, File Store, and Edge. In order to restore all User services, including pChat, we will need to activate all these services in the secondary datacenter. In order to shorten this blog a little there are links to the edge failover process at end.
1 Add all pChat servers (using FQDN) in secondary datacenter (West datacenter) separated by commas in between quoted servers names.
Once the failover is complete users can reconnect and all pChat functionality will be restored using services from the secondary (paired) datacenter.
Scenario 2B: Complete Site failback
Once the East datacenter comes back online and all services come online we need to perform a failback.
Set-CsPersistentChatActiveServer (no switches)
If you enabled DB mirroring on the backup DB in the secondary datacenter disable mirroring
1 Add all pChat servers (using FQDN) that were active before failover separated by commas
2 Although this isn't required as traffic will still route through the FE pool in the secondary datacenter, it is recommended.
Scenario 3A: pChat Pool Server failure (Figure 3)
Figure 3: pChat Pool failure in East Datacenter
The third failure scenario that we will explore is a pChat Pool Server failure in which all active pChat servers are located in the East Datacenter1. We will need to failover all pChat services to the secondary datacenter. Once we complete the failover of the pChat services, EEPool1 will route pChat traffic to the pChat servers located in the secondary datacenter.
1 In this scenario if we have active pChat servers split between the datacenters pChat functionality would continue to work. For optimal performance you would still want to follow the steps above in order to failover the backend pChat DB. This should be done during off hours so that production user impact is minimized.
2 Add all pChat servers (using FQDN) in secondary datacenter (West datacenter) separated by commas in between quoted server names.
Once the failover is complete EEPool1 will reconnect the users to the pChat servers located in the secondary (paired) datacenter.
Scenario 3B: pChat Pool Server failback
Once the pChat servers in the East datacenter come back online we should failback1. This should be done during off hours so that production user impact is minimized.
1 The pChat services will continue to function in this failed over state ever after the pChat servers in the East Datacenter come back online. The reason we want to failback is so that we are in the same production state we were prior to the pChat server failures in the East Datacenter. This will ensure we are in a "Normal" state instead of "Failed Over".
2 Add all pChat servers (using FQDN) that should become active separated by commas in between quoted server names.
Scenario 4A: pChat SQL Data Loss
The final failure scenario includes the loss of data from the backend SQL pChat DB (MGC) or User error (deletion). This database includes the pChat room content, principals, and access permissions for the pChat rooms. The pChat data can be backed up in one of the following two ways:
Figure 4: Export-CsPersistentChatData cmdlet & ZIP file contents
Data that is created by using SQL Server backup requires significantly more disk space—possibly 20 times more—than that created by Export-CsPersistentChatData, but SQL Server backup is more likely to be a procedure that administrators are familiar with. The export in Figure 4 utilizing the Export-CsPersistentChatData cmdlet totaled 150 KB vs. a 4 MB SQL backup.
Scenario 4B: pChat SQL Data Recovery
In order to restore the data that we backed up above you can perform the steps below.
Hopefully this helps everyone understand what process they will need to perform based on their specific failure.
Edge Failover Process - http://technet.microsoft.com/en-us/library/jj721897.aspx