Resynchronization of virtual machines in Hyper-V Replica

Resynchronization of virtual machines in Hyper-V Replica

  • Comments 8
  • Likes

What is resynchronization and why is it needed?

Hyper-V Replica provides protection to VMs by tracking and replicating changes to the virtual hard disks (VHDs) of the VM. Hyper-V Replica runs 24 hours, 365 days in a year; for any VM that has been enabled for replication it ensures that the data on the primary site and the Replica site are kept as closely in sync as supported.

To begin with, Hyper-V Replica (HVR) requires that the data on the virtual hard disks (VHDs) of the primary and replica VMs be the same. This is achieved through the process of initial replication, and establishes a baseline on which replicated changes can be applied. However, due to factors beyond the control of the administrator – such as faulty hardware and OS bugchecks – it is possible that the primary and Replica VMs are not in sync.

Thus in a rainy day scenario (details in following section), when HVR determines that the replica VM can no longer be kept in sync with the primary by applying the replicated changes then resynchronization is required. Resynchronization (or Resync) is the process of re-establishing the baseline – by ensuring that the primary and replica VHDs have exactly the same data stored.

(NOTE: In this post we will use a VM named “RESYNC VM” in all examples and screenshots.)

 

 

When does resynchronization happen?

It would become quite obvious after going through this table below that Resync is not expected to occur regularly. In fact, in the normal course of replication this is quite a rare event. The VM enters the “Resynchronization Required” state when any one of the conditions are encountered:

Site

Condition

Scenario example

Primary

Modify VHD when VM is turned off

Mount/modify VHD outside the VM, Edit disk, Offline patching

Primary

Size of tracking log files > 50% of total VHD size for a VM

Network outage causes logs to accumulate

Primary

Write failure to tracking log file

VHD and logs are on SMB and connectivity to the SMB storage is flaky.

Primary

Tracking log file is not closed gracefully

Host crash with primary VM running. Applicable to VMs in a cluster also.

Primary

Reverting the volume to an older point in time

Reverting the VM to an older snapshot

Volume/snapshot backup and restore

Secondary

Out-of-sequence or Invalid log file is applied

Restoring a backed-up copy of the Replica VM

Importing an older VM copy, when migration by using export-import

Reverting volume to an older point in time using Volume backup and restore.

Reverting the VM to an older snapshot

  

When the VM enters the “Resynchronization Required” state, the replication health becomes “Critical” and the VM is scheduled for resynchronization. At the same time, HVR stops tracking the guest writes for the VM and nothing is replicated.

The replication health will also show this message:

resync 002

 

 

 

Initiating and scheduling resynchronization

Depending on the VM setting, the user might have to trigger the resynchronization operation explicitly. When that is required, follow the instructions as given in the replication health screen:

  1. Right-click on the VM for the options
  2. Under Replication, select the Resume Replication option

You will be presented with the screen to schedule the resynchronization operation:

resync 003

To start the resync operation from PowerShell, use the Resume-VMReplication commandlet:

Resume-VMReplication –VMName “RESYNC VM” -Resynchronize –ResynchronizeStartTime “04/15/2013 12:00:00”

 

User-initiated resynchronization is also possible, but unless absolutely necessary it should be avoided. In order to explicitly force resynchronization on a VM that is not in the “Resynchronization Required” state, first suspend the replication and then initiate resync:

Suspend-VMReplication -VMName "RESYNC VM"
Resume-VMReplication -VMName "RESYNC VM" -Resynchronize

 

The scheduling of the resynchronization operation can be configured for each VM:

  1. On the primary site, open the Hyper-V Manager
  2. Right-click on the desired VM, and select the Settings… option
  3. In the left hand pane under Replication, select the Resynchronization option

resync 006

The default option is to schedule the resynchronization operation during off-peak hours. The resource intensive nature of the operation makes such scheduling useful, and aims to reduce the impact on running VMs.

The same can be configured in PowerShell using the Set-VMReplication commandlet:

# Manual resync
Set-VMReplication -VMName "RESYNC VM" -AutoResynchronizeEnabled 0
 
# Automatic resync
Set-VMReplication –VMName "RESYNC VM" -AutoResynchronizeEnabled 1 -AutoResynchronizeIntervalStart 00:00:00 -AutoResynchronizeIntervalEnd 23:59:59
 
# Scheduled resync
Set-VMReplication –VMName "RESYNC VM" -AutoResynchronizeEnabled 1 -AutoResynchronizeIntervalStart 00:00:00 -AutoResynchronizeIntervalEnd 06:00:00

 

To see the resynchronization settings in PowerShell, use the Get-VMReplication commandlet and look for the AutoResynchronizeEnabled, AutoResynchronizeIntervalStart, and AutoResynchronizeIntervalEnd fields:

Get-VMReplication -VMname "RESYNC VM" | fl *

 

 

 

The process of resynchronization

When the resync operation is triggered – either automatically or by the user – the following high-level sub-operations are executed in sequence:

  1. Check the VHD characteristics of primary and replica VMs:   before resync can be done, these have to match. Hyper-V Replica checks the geometry and size of the disk before starting resync. Top on the list of exceptions to watch out for are size mismatches – caused by resizing either a primary or replica VHD without appropriately resizing the other one.
  2. Start tracking the VHDs:   
    1. The guest writes are tracked into the log file, but these changes are not replicated until resync is completed.
    2. It is important to note that if resync takes too long then you might hit the “50% of total VHD size for a VM” condition and end up sending the VM into the “Resynchronization Required” state again.
    3. Event number 29242 is logged that specifies the VM, VHDs, start block, and end block.
  3. Create a diff disks for the replica VHDs:   this allows the resync operation to be cancelled without leaving the underlying VHD in an inconsistent state. The diff disk with all the resync-ed changes is then merged back into the VHD at the end of the resync operation.
  4. Compare and sync the VHDs:    the comparison of the VHDs is done block-by-block and only the blocks that differ are sent across the network. This can reduce the data sent over the network, depending on how different the two VHDs are. While this operation is going on:
    1. Pause Replication will stop the current resync operation. Doing Resume Replication later will continue the resync comparisons from where it left off.
    2. Planned failover or Test failover will not be possible.
    3. At any point the user can always do Unplanned Failover, but this will cancel the resync operation.
    4. Resync can be cancelled at any point. This will keep the VM in the “Resynchronization Required” state, and the next time replication is resumed, it will start from the beginning.
  5. Completion of compare and sync:     HVR logs event number 29244 once the compare and sync operation is done, and it specifies the VHD, VM, blocks sent, time taken, and result of the operation.
  6. Merge the resync changes to the VHD:     after this operation completes, the resync operation cannot be cancelled or undone.
  7. Delete the recovery points:   this is a significant side-effect of resync. The recovery points are built upon the VHD as a baseline. However, resync effectively changes that baseline and makes the data stored in those recovery points invalid. After resync completes, the recovery points are built again over a period of time.

 

 

Resynchronization performance

Resynchronization performance was tested and compared against the performance of Online Initial Replication (IR). The setup consisted of a standalone server with 4 running VMs – 2 File Servers and 2 SQL servers running typical workloads. Two VMs were replicated to a standalone Replica server. The network bandwidth was varied to see the impact. Data size that was replicated during Online IR was approximately 80GB.

  Network speed Online IR size Online IR time Resync size Resync time
Resync – offline scheduling 1 Gbps ~80 GB ~1.5 hrs ~5.5 GB ~2 hrs
Resync – immediate 1 Gbps ~80 GB ~1 hr ~100 MB ~1 hr
           
Resync – offline scheduling 1.5 Mbps ~80 GB 4 days ~10 GB ~1 day
Resync – immediate 1.5 Mbps ~80 GB 4 days ~ 78 MB ~1 hour

The tests indicate that resync is preferable to Online IR in low speed networks. When the two sites are connected by a high speed network, resync works well for low churn workloads.

There is also a perfmon counter for measuring the resynchronized bytes:  \Hyper-V Replica VM\Resynchronized Bytes.

 

Conclusion

The disks going out of sync is a rainy-day event in Hyper-V Replica. However with the Resynchronization operation, this is handled gracefully within the product to optimize the administrative overhead and the resources used in bringing the disks back into sync.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Hyper-V Replica runs 24 hours, 365 days in a year ::-> I see a leap year bug right there ;)

  • Question: Is it possible to run a separate VM on the replica server that isn't part of the replication topology? For example, I want to replicate a critical production server to remote site, and at the remote replica site, also run a VM that isn't being replicated on the replica server.

  • Chris-Arch:  Absolutely! A host that has been enabled as a Replica server is first and foremost a Hyper-V host, and being a Hyper-V host it can run VMs.

    Your question also points to a well known deployment style. Since the Replica VMs are turned off during the normal course of replication, there are system resources that can be put to good use - and customers do run VMs on these hosts to utilize these resources. Prevents Replica servers from sitting idle (figuratively speaking).

    A word of caution here:  going down this path requires you to be aware of the resources required at the time of failover. If the Replica servers cannot accommodate all VMs when running then there has to be a plan to handle that situation.

  • Great writeup. We had a power outage situation and the Replica 2012 R2 Servers rebooted. However one of the running 'source' VM's of the Replica came up with file corruption that was not caught immediately - people had to log on and work for a few minutes before it became apparent. However in that time the target VM that it was replicating too also became corrupted. Now in theory we could have a days worth of snapshots but my preference would be that Replica not auto-start on a reboot. I have not had any luck yet in finding out how that could be done. Your input is appreciated in advance. Arlester.

  • Hi Arlester, Let me answer this in a few stages. The short answer to your question is: there is no out-of-the-box way to do this today. Any solution to solve your problem will involve some amount of scripting. 1) Regarding file corruption: The way Hyper-V Replica works is that if there is a possible VHD corruption detected, we set the VMs to resynchronize. So if your VM starts replicating on host restart, it means that the VHD is okay. Of course, that says nothing about the application using the VHD... and it is quite possible that there are inconsistencies at an application-level that would still be consistent at the VHD level. A good practice here would be to schedule the resynchronization or make it manual so that it is not triggered immediately. The settings for resynchronization can be found under VM Settings --> Replication --> Resynchronization. 2) Pausing replication: In the scenario that you have described, you seem to be looking for a way to pause the replication on a host restart. There are two places where you can pause the replication: on the source or on the destination. The suggestion I would give is to write a script to monitor your source and pause VMs on the destination when the source is unreachable. Pausing replication on the destination will reject incoming packets from the source, while pausing replication on the source will stop sending the packets itself. The reason I think pausing on the destination is better is because you will need a VM object to trigger this action... and on the source this needs to be done as soon as the service is up. This can lead to timing issues which can be easily avoided by pausing on the destination site. You can use PowerShell scripts or create a SCO runbook to monitor and pause the VMs. Hope this helps.

  • I resized my disk (VHD) . and now it saying Cannot Perform operation for Virtual Machine as virtual size of one or more virtual hard disk are different between primary and replica servers. Delete and re enable replication.

    What are the steps?

  • Hi Geoffrey, I would suggest reading the blog post about online resize: http://blogs.technet.com/b/virtualization/archive/2013/11/14/online-resize-of-virtual-disks-attached-to-replicating-virtual-machines.aspx

    Also, please check what your OS version is. Online resize is supported only from Windows Server 2012R2 onwards.