How to use and troubleshoot the Auto-heal features in DPM 2010 - The Official System Center Data Protection Manager Team Blog - Site Home - TechNet Blogs

How to use and troubleshoot the Auto-heal features in DPM 2010

How to use and troubleshoot the Auto-heal features in DPM 2010

  • Comments 7
  • Likes

GrayAndYellowGearsHi everyone, Shane Brasher here. In my opinion, the biggest enhancement from DPM 2007 to DPM 2010 is the auto-healing capabilities that enable protected datasources to remain reliably consistent. System Center Data Protection Manager 2010’s auto-healing features include Auto-grow, Auto-CC, and Auto-rerun.

Auto-grow

Auto-grow is the capability to grow a volume once the volume threshold is reached.

Example: You have a datasource that is 10 gigs in size and a replica volume with 15 gigs allocated. Your datasource grows in size to 16 gigs. Auto-grow, if enabled, will increase the volume size to accommodate the new datasource size.

Enabling Auto-grow

Enabling Auto-grow can be done either during the creation of the protection group or after the creation of the protection group.

a.) During the creation of the protection group you can select the option for Auto-grow to be enabled during the "Create New Protection Group" wizard.

clip_image001

b.) If you did not enable Auto-grow during the creation of the protection group, then you can still enable it by selecting the datasource and choosing "modify disk allocation". From that screen, you can enable Auto-grow.

clip_image002

Auto-grow Key Points

1.) Auto-grow needs to be enabled for the datasource.
2.) We grow the replica/recovery point volume by 10gigs or by 25% whichever is higher.
3.) To prevent race conditions, we do not grow within 15 minutes of the last growth.

Instances where Auto-grow may not happen

1.) Auto-grow is not enabled
2.) Disk threshold is exceeded. For example, if you have 500GB for the storage pool and you have already used 499.9GB, then naturally you cannot grow beyond the space available.
3.) An attempt to Auto-grow is performed less than 15 minutes from the last attempt    MJ – I flipped the order of these
4.) DPM is close to reaching the LDM limit – see note below.

Note: Logical Disk  Manager or LDM is a hidden partition that stores the dynamic disk database, which contains information about all dynamic disks and volumes installed on the computer. The LDM database can only store 2960 records before it can no longer create or extend volumes.  It is important to remember that the LDM limitation is a Windows limitation rather than a DPM limitation.

Auto-CC
Auto-CC, or auto consistency check, is an enhancement that triggers another consistency check attempt after 15 minutes of a previously failed consistency check.

Key Points

1.) Auto-CC is a Protection Group property that needs to be enabled.
2.) Triggers a Consistency Check after a replica invalid alert on a 15 minutes delay.
3.) Auto-CC will be attempted only once by default. Note: This is configurable via the registry entry below.

Instances to where Auto-CC may not happen

1.) When the maximum number of Auto-CCs are reached.
2.) Auto-CC is not enabled for the protection group – See screen shot below
3.) Auto-CC will not be performed for application datasources if the job that marked the replica inconsistent was a CC.

clip_image003

Auto-CC Registry Entry

HKLM\Software\Microsoft\Microsoft Data Protection Manager\Configuration

Value: AutoCCNumberOfAttempts
Type: DWORD
Settings Controlled: The number of times DPM will try to fix an inconsistent replica before giving up if it consistently fails.
Implications: Default is 1. Increasing may increase load on system.

Note* Altering the AutoCCNumberOfAttempts will affect all DPM consistency checks for all datasources. Setting this value to a high value will increase the load on the DPM server.

Auto-rerun
Auto-rerun is an enhancement in DPM 2010 that will automatically rerun jobs that fail.

Key Auto-rerun points

1.) Delay for Auto-rerun is 1 hour by default.
2.) Auto-rerun will not be attempted if another job is scheduled to run within 4 hours or for any Adhoc jobs.
3.) The default number of retries is 1. Note that this can be changed via registry.
4.) Auto-rerun will be triggered when the following alerts are raised:
- Consolidation of recovery points of the replica failed.
- Recovery point creation failed.
c.) Synchronization failures.

Auto-rerun registry entries
HKLM\Software\Microsoft\Microsoft Data Protection Manager\Configuration

Value: AutoRerunNumberOfAttempts
Type: Dword
Settings Controlled: The number of times a failed job will be retried before giving up if it consistently fails.
Implications: Default is 1. Increasing this value may increase the load on your system. The reruns are done at the gap of AutoRerunDelay.

Note* Altering the AutoRerunNumberOfAttempts will affect all jobs for all datasources. Setting this value to a high value will increase the load on the DPM server.

Value: AutoRerunDelay
Type:DWORD
Settings Controlled: The delay in time before which DPM will attempt to automatically rerun failed jobs.
Implications:  This should be changed if you have typical production server down time/network down time in excess of the default value of 60 mins. If multiple reruns are configured this is the gap between the reruns as well. If set to zero, the will rerun immediately.

What triggers Auto-heal to take place?

Auto-heal is triggered when one of the following alert is raised/updated:

a.) Auto-grow
Disk threshold exceeded
Recovery point volume threshold exceeded

b.) Auto-rerun
Consolidation of recovery points of the replica failed
Recovery point creation failed
Synchronization failures

c.)Auto-CC
Replica in invalid alert

Troubleshooting Auto-heal

The traces for Auto-heal will be present in: <DPMInstallPath>\DPM\Temp\DPMAccessManager*.*
The examples below are what is shown in the Auto-heal logging.

Auto-grow trace examples
a.) Disk threshold reached

Received alert not present in cache; AlertId=[95b6c1f0-5c81-4a3a-84e5-7d75f3ca52ee],
AlertType=[DiskThresholdCrossedAlert], Resolution=[InvisibleAndActive],
CorrectiveAction=[AutoGrow], DatasourceId=[5c715a3a-d505-458f-833e-
65ef87a0e6d6], DatasourceName=[DPMDB]

b.) Auto-grow failure due to not being enabled for the protection group example:

AutoHeal: AutoGrow not enabled for datasource. AlertId=[95b6c1f0-5c81-4a3a-84e5-
7d75f3ca52ee]

c.)Auto-grow failing due to hitting the LDM limit example:

AutoHeal: LDM database within error threshold, skipping grow; AlertId=[95b6c1f0-
5c81-4a3a-84e5-7d75f3ca52ee]

d.) Auto-grow failure due to trying to grow too soon after the previous Auto-grow failure:

AutoHeal: A grow alert was recently resolved; AlertId=[95b6c1f0-5c81-4a3a-84e5-
7d75f3ca52ee]

Auto-rerun trace examples

a.) Maximum retry has been reached example:

AutoHeal: Maximum number of attempts exceeded. Hence skipping action;JobId =
[911e26a3-7c44-47e2-9212-e92a737f6dbd] RetryAttemptNumber = [2] Number of
attempts allowed = [2] AlertId=[a377df73-5f63-43e5-9738-facb8ed26ccd]

b.) Failure due to pending task example:

AutoHeal: Number of scheduled jobs found between 22-01-2010 11:24:43 and 22-01-
201016:24:42 is 6 jobDefId 61225cc0-bd2e-4127-a1ec-533c78b49f7f AutoHeal:
There is a scheduled job up for execution shortly.Skipping action
;
AlertId=[e917c3e3-d61a-469a-b0e4-b684529a8df7], JobdefId=[61225cc0-bd2e-4127-
a1ec-533c78b49f7f]

c.) For adhoc jobs we will not attempt a retry example:

AutoHeal: GetRunTimes threw ScheduleNotFoundException, skipping autoheal for
adhoc job
; JobDefId=[a13d98a1-4543-4b36-8e10-ce524bd63c00] AutoHeal: There is
a scheduled job up for execution shortly.Skipping action
; AlertId=[be668e55-fdf5-
44b4-8973-9e4323e57f97], JobdefId=[a13d98a1-4543-4b36-8e10-ce524bd63c00]

Auto-CC trace examples

a.) We will perform Auto-CC when “replica inconsistent” alert gets raised. We will trace the following whenever we get this alert:

Received alert not present in cache; AlertId=[c40dad51-eb07-4ead-af4a-003c38d3f47], AlertType=[ShadowCopyFailedAlert], Resolution=[InvisibleAndActive], CorrectiveAction=[AutoCC], JobId=[f8d60eb4-5d3a-4b25-b0b3-92a2f107a7c7], TaskDefId=[a8c60b1d-f510-4b3c-9220-f4dd13671bbe], DatasourceId=[18eb0c09-1437-4b7d-8e08-32cdba7c3565], DatasourceName=[SharepointSrv-\server1_srv_01_SharePoint_Config]

b.) Failure due to the fact that the maximum number of CC attempts have been reached:

AutoHeal: Maximum number of attempts exceeded. Hence skipping action;JobId =
[663de05f-32c5-4123-94c3-b90fba738755] RetryAttemptNumber = [1] Number of attempts allowed = [1] AlertId=[91a56814-f45e-45d0-8df9-da2013824036]

c.) Failure due to AutoCC not being enabled example:

AutoHeal: AutoCC is not enabled; AlertId=[2097a57f-d875-4354-ab4e-da8ce941995],
PgId=[2fb98abf-c81a-4af8-8805-7bf7ca225475]

d.) For application datasources, AutoCC will not be performed if CC job itself marked the replica inconsistent:

AutoHeal: Not triggering CC for Apps because a CC task only marked replica inconsistent; AlertId=[e56a67ac-c93a-42d0-b900-e2f8a525bfed]

e.) When we decide to perform CC and start the timer, one of the following traces can be found:

AutoHeal: Adding wait for EndOfTask action; TaskDefId=[8001702e- 9c05-485b-92fa-
e574f76e3c73], Wait=[15 mins]

AutoHeal: Adding wait before EndOfJob action; JobId=[f8d60eb4-5d3a-4b25-b0b3-
92a2f107a7c7], Wait=[60 mins]

Additional Resources:

How Dynamic Disks and Volumes Work: http://technet.microsoft.com/en-us/library/cc758035(WS.10).aspx

About DPM 2010 Scalability and New Features: http://blogs.technet.com/b/dpm/archive/2010/02/26/about-dpm2010-scalability-and-new-features.aspx

How to Modify Disk Allocation: http://technet.microsoft.com/en-us/library/ff399705.aspx

Shane Brasher | Senior Support Escalation Engineer

The App-V Team blog: http://blogs.technet.com/appv/
The WSUS Support Team blog: http://blogs.technet.com/sus/
The SCMDM Support Team blog: http://blogs.technet.com/mdm/
The ConfigMgr Support Team blog: http://blogs.technet.com/configurationmgr/
The SCOM 2007 Support Team blog: http://blogs.technet.com/operationsmgr/
The SCVMM Team blog: http://blogs.technet.com/scvmm/
The MED-V Team blog: http://blogs.technet.com/medv/
The DPM Team blog: http://blogs.technet.com/dpm/
The OOB Support Team blog: http://blogs.technet.com/oob/
The Opalis Team blog: http://blogs.technet.com/opalis
The Service Manager Team blog: http: http://blogs.technet.com/b/servicemanager
The AVIcode Team blog: http: http://blogs.technet.com/b/avicode
The System Center Essentials Team blog: http: http://blogs.technet.com/b/systemcenteressentials
The Server App-V Team blog: http: http://blogs.technet.com/b/serverappv

clip_image001 clip_image002

.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Can you tell us what the DiskThresholdCrossedAlert value is, and separately if that is changeable? We understand the 10 gigs/25% can't be changed, but we were wondering if we could change it since SCOM alerts&pages due to low disk space on volumes that DPM is going to take care of. I.E. If we know what the value is even if we can't change it, then we can tweak SCOM appropriately.

  • Great Info.