I'd like to share with you some tips to mitigate a few frustrating intermittent problems with hyper-v and system center.

1. Data Protection Manager 2010

BEFORE installing DPM agents on systems running Hyper-V R2, you must install hotfixes KB975921 and KB975354. Note that if you had a beta or rc version of dpm, upgrading the dpm server will NOT upgrade the agents and you will not be prompted to do so. You will only see "inconsistent replicas". 

The first KB refers to a situation where the volume snapshot provider for your clustered volume crashes and does not return a "completed" signal. Alas, the volume is not returned to its original state, the cluster resource fails and all its virtual machines with it. You may be lucky and never encounter this problem; alas the quality of snapshot providers varies widely so I suggest you install the patch anyway.

The second KB refers to a common scenario where you want to backup virtual machines running on different nodes of a cluster at the same time. If the machines sit on the same cluster volume, ownership of the volume must be transferred to the node requesting the snapshot at that point. Alas this transfer may happen before the post-snapshot steps of a virtual machine backup are complete, so the replica is inconsistent. If at all possible, I suggest to build the DPM protection groups and time the backups in such a way that frequent transfers of volumes are avoided.

2. Virtual Machine Manager and Operations Manager

VMM and OM agents rely on WMI heavily. When you consolidate dozens of virtual machines on a cluster and want to take advantage of the PRO functionality (hence run both agents on all nodes), the WMI service is heavily loaded and may crash. It restarts without too much fuss and no data is lost, but anything depending on it fails. In your VMM console you may notice that all machines on a node seem to fail at the same time, or that the connection to the VMM agent running on that node times out. Operations manager will also issue critical alerts. Live migrations and deployments may be interrupted. To mitigate the problem, you must install KB974930 and KB981314.

The first KB addresses a memory leak of the Win32_Service WMI class. The second KB addresses timeouts in WMI queries about a failover cluster object.

You may also want to increase the number of concurrent connections and the timeout period for the Windows Remote Management service (winRM). To do so, you could run this script on all nodes of the cluster:

winrm set winrm/config/Service @{MaxConcurrentOperationsPerUser="400"}

winrm set winrm/config @{MaxTimeoutms = "1800000"}

net stop vmmagent

net stop winrm

net start winrm

net start vmmagent

The script increases the number of concurrent WM operations to 400 (the default is 200) and sets the default timeout to 30 minutes (it should be plenty). Note that this is just the default timeout - operations may specify their own.

For those values to be taken into consideration, the script then restarts the vmmagent and winrm services.

In my experience applying these fixes reduces the frequency and duration of such timeouts, but does not eliminate them completely. You may have to tweak the numbers over time to improve the situation further.