This post is about Windows Server 2008 with the Hyper-V role installed, that are being protected by System Center Data Protection Manager 2007.  There may be one or many Virtual Machines on each Host/Parent Partition, and they may be running Windows 2003 and/or Windows 2008.  Supposing the DPM Agent is installed only on the Host/Parent partition of the Hyper-V server, you may find that DPM jobs fail intermittently on the 2003 VM’s, but the 2008 VM’s successfully complete.  The following error may be encountered:

Type: Recovery point
Status: Failed
Description: DPM encountered a retryable VSS error. (ID 30112 Details:
Unknown error (0x800423f3) (0x800423F3))
End time: 4/23/2009 3:37:22 PM
Start time: 4/23/2009 3:36:38 PM
Time elapsed: 00:00:44
Data transferred: 0 MB
Cluster node -
Recovery Point Type Express Full
Source details: \Backup Using Child Partition Snapshot\%ServerName%
Protection group: %ProtectionGroupName%

We found these jobs fail when the Volume Shadow Copy Service (VSS service) on the guest VM is in a “Stopping” state and the only way to get the service in a good condition is to kill the process or reboot the VM.  If the VSS service is in this “Stopping” state the next DPM job will fail.  But if you first verify the VSS service is in a correct state (running or stopped) the DPM job will run successfully.  However, once the DPM job is done you may see the VSS service stuck in the “Stopping” state. This service should automatically stop after 3 minutes of idle time but intermittently it may not stop.  We experienced this behavior across several Hosts and almost all VM’s in a particular environment.  The behavior is random but a few VM’s experience the problem more frequently than others.  We also noticed if the VM is rebooted it will likely work without issues for a few days before the problem re-occurs.


When using vssadmin Windows Server command (see http://technet.microsoft.com/en-us/library/cc754968(WS.10).aspx), it appeared the “Microsoft Hyper-V VSS Writer” on the host was in a “Failed” state with a “Retryable” Last error state when the job fails.  Ordinarily the writer will show a “Stable” state, and “No error” as follows.

image

When the jobs fail, the above command will return:

Writer name: 'Microsoft Hyper-V VSS Writer'
Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
Writer Instance Id: {59f449f9-2413-494d-b679-965bc56129fd}
State: [8] Failed
Last error: Retryable error

After installing Service Pack 2 for Windows 2008, or hotfix KB967560 (see Resolution section below) and running another DPM job on a different VM that has the VSS service in a non-stopping state, the job will run succesfully and place the Hyper-V VSS Writer back into a “Stable” state.

Quick reference for different possible scenarios:


VM in good state + Host in good state = Good backup
VM in bad state + Host in good state = Failed backup
VM in bad state + Host in bad state = Failed job
VM in good state + Host in bad state = Good job

Another possible symptom, when a DPM job is running you may notice on the Hyper-V Management screen next to the VM it displays this message:   “Creating VSS Snapshot Set…”.  This will continue to be displayed and when looking at the Volume Shadow Copy Service inside the W2K3 VM you may notice the service is stopped.  Additionally, when running “vssadmin list writers” on the Host a message is displayed but no writers are visible:

“Waiting for responses. These may be delayed if a shadow copy is being prepared.”

When this condition occurs you may not be able cancel the DPM job.  When trying to cancel the DPM job you may get the following in the detailed pane:


Type: Recovery point
Status: Attempting to cancel
End time: -
Start time: 4/28/2009 10:00:21 AM
Time elapsed: 01:01:22
Data transferred: -
Cluster node -
Recovery Point Type Express Full
Source details: \Backup Using Child Partition Snapshot\%Servername%
Protection group: %ProtectionGroupName%


Looking in the System Event log on the W2K3 VM during the times when the DPM jobs failed it may be clean. But the Application event log may be filled with the VSS errors below:


Event Type: Error
Event Source: VSS
Event Category: None
Event ID: 8193
Date: 2/24/2009
Time: 8:46:34 AM
User: N/A
Computer: %SystemName%
Description:
Volume Shadow Copy Service error: Unexpected error calling routine IEventSystem::Store. hr = 0x80040206.


Event Type: Error
Event Source: VSS
Event Category: None
Event ID: 12302
Date: 2/24/2009
Time: 5:58:49 AM
User: N/A
Computer: %SystemName%
Description:
Volume Shadow Copy Service error: An internal inconsistency was detected in trying to contact shadow copy service writers. Please check to see that the Event Service and Volume Shadow Copy Service are operating properly.


When viewing the System and Application event logs on the DPM server neither have any entries for the same times as the job failures. But the DPM event log may have the following entry:


Event Type: Error
Event Source: DPM-EM
Event Category:None
Event ID: 2
Date: 2/19/2009
Time: 9:20:20 AM
User: N/A
Computer: %DPMServername%
Description:
Creation of recovery points for Backup Using Child Partition Snapshot\%ProtectedServerName-VM% on %HOSTName% have failed. The last recovery point creation failed for the following reason: (ID: 3159) DPM encountered a retryable VSS error. (ID: 30112)
DPM ID: 2^|^%DPMServername%^|^Recovery point creation failures^|^DPM^|^Backup^|^%HOSTName% ^|^a48c6c91-f4ae-4ed3-b5da-a3c22d980a48

 

RESOLUTION

The Hyper-v issue seems to be the result of the underlying state of VSS.  VSS is hung in the "stopping" state because the registry writer is hung attempting to unregister a COM+ event subscription.  This is a subscription for listening for COM messages from other VSS components.  When analyzing the logs captured during the problem it was found the unsubscribe function had been waiting eight minutes when the trace ended (and still had not completed).

It could be that the machine is having COM issues. The VSS service is not going to be successful with processing subsequent jobs until this unsubscribe completes.  If you experience any of the symptoms mentioned above, you should perform all of the action items noted below.


Action Item #1:


Verify all the Prerequisites are met for protecting Hyper-V with DPM:

Prerequisites and Known Issues with Hyper-V Protection
http://technet.microsoft.com/en-us/library/dd347840.aspx

Action Item #2:


Online backups are not possible if any of the following conditions are not met.  Verify that all the W2K3 VM’s meet these requirements:

1.  Hyper-V Integration components is installed and is running the latest version

NOTE: (On the Host/Parent partition you can check VMMS.exe = 6.0.6001.22352 (or newer) and in the guest, check vmbus.sys version 6.0.6001.22334 (or newer)

2. No Dynamic disks inside the guest.

3. All volumes are NTFS

4. All NTFS volumes must be >1GB and have >300MB free space.

5. Shadow copies within the VM are on the same volume or are Disabled

6. VM is in running state.

NOTE: Offline Backups of Windows 2000 Guest VMs fail. Cause: A synthetic SCSI Controller was configured for the VM with no drives attached. Windows 2000 Guests do not support the SCSI Controller, so it is not needed.

Action Item #3:


The root cause of symptoms noted in the Problem section appear to be COM related. After verifying the action items above install the following COM updates:


KB934016 "Availability of Windows Server 2003 Post-Service Pack 2 COM+ 1.5 Hotfix Rollup Package 12"

http://support.microsoft.com/default.aspx?scid=kb;EN-US;934016

KB965230 "FIX: The COM+ Event System does not deliver timely or reliable statistics to subscribers of the IComTrackingInfoEvents event interface in Windows Server 2003"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;965230

KB968447 "The COM+ Event System stops processing the query for matching subscriptions when it detects a corrupted subscription on a Windows Server 2003-based computer"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;968447

Action Item #4:

Install the following two W2K3 VSS updates on the W2K3 virtual machines:


KB940349 “Availability of a Volume Shadow Copy Service (VSS) update rollup package for Windows Server 2003 to resolve some VSS snapshot issues”

http://support.microsoft.com/default.aspx?scid=kb;EN-US;940349


KB969219 “RPC 0x800706ba and 0x800706bf errors occur when backup software tries to create VSS shadow copies on a computer that is running Windows Server 2003 SP2”
http://support.microsoft.com/default.aspx?scid=kb;EN-US;969219


Install the latest VSS/Volsnap update on the W2K3 VM’s. If the Host is also running W2K3 it will be a good idea to also install:
KB967551 “Rollup update for the volsnap.sys driver in Windows Server 2003”
http://support.microsoft.com/default.aspx?scid=kb;EN-US;967551


Action Item #5:


If possible, install W2K8 SP2 since it will include the most recent Hyper-V writer updates. But, there are situations where installing SP2 will not be an option. As an alternative you can install KB967560 and KB971394 on the Windows Server 2008 Host machine.

KB967560 update is more recent then KB959978 which does address a known issue when you run a Windows Server 2003-based virtual machine on a Windows Server 2008 Hyper-V-based computer:


KB967560 “A backup operation fails on a two-node failover cluster that is running Windows Server 2008 after one of the disk resources is moved”

http://support.microsoft.com/default.aspx?scid=kb;EN-US;967560

KB971394 "A backup of virtual machines fails when you use the Hyper-V VSS writer to back up virtual machines concurrently on a computer that is running Windows Server 2008"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;971394


How to obtain the latest service pack for Windows Server 2008
http://support.microsoft.com/kb/968849


ADDITIONAL INFORMATION:


Virtualization with Hyper-V: Supported Guest Operating Systems
http://www.microsoft.com/windowsserver2008/en/us/hyperv-supported-guest-os.aspx

Author:
Tom O’Malley
Microsoft Enterprise Support
Sr. Support Escalation Engineer