Service Management Automation: Checkpoint, Suspend, and Resume Runbooks

Service Management Automation: Checkpoint, Suspend, and Resume Runbooks

  • Comments 2
  • Likes

One of the key features in Windows PowerShell Workflow is support for checkpointing – this is the ability to persist the state of a workflow so that if the workflow is interrupted intentionally or due to an error or crash it can later be resumed at or near the interruption point. Service Management Automation (SMA) uses PowerShell Workflow as the engine for running runbooks. Thus, checkpointing is a powerful feature that you will want to leverage in your SMA runbooks. Thoughtful use of checkpointing will allow you to create runbooks that dependably automate long-running IT processes, reliably access numerous different networked systems, guarantee the non-repeat of actions that should not be repeated (not idempotent) or are expensive to repeat, and that can be intentionally interrupted for inclusion of manual steps.

In this blog post, I will talk about why, when, and how you should use checkpointing in your SMA runbooks. There is existing information about checkpointing in PowerShell Workflow; you should brush up on this to help with your understanding.

What is a Checkpoint?

A checkpoint is a snapshot of the current state of a runbook job, including the current values of variables, any output, and other serializable state information. Each checkpoint gets saved to storage. If a runbook is suspended, either intentionally or unintentionally, and then resumed, the workflow engine uses the data in the latest checkpoint to restore and resume the runbook.

Checkpointing in SMA

In SMA, when you persist a runbook job the checkpoint is created and then stored in the SMA database. Only the latest checkpoint for each job is stored in the database: each checkpoint replaces the previous. If the runbook gets suspended and then resumed, the stored checkpoint will be used to restore and resume the runbook.

Unlike PowerShell Workflow which stores checkpoints to the hard drive of the machine hosting the workflow session, SMA stores checkpoints in the SMA database. If you deploy the SMA database and runbook workers on separate machines, then if the worker running your runbook crashes, the same restarted worker or another worker can pick up the job and use the last checkpoint in the database to resume the job.

Why Checkpoint?

Here are some reasons to use checkpointing in your runbooks.

  • Assure that certain actions are not repeated
    • Checkpointing is useful for guaranteeing that non-repeatable actions (non-idempotent) are not repeated if a runbook suspends and then resumes. One example is to checkpoint a runbook right after creating a VM so that a duplicate VM would not be created if the runbook job were suspended and then resumed.
  • Protect long-running tasks
    • In the real world, errors happen. Long-running tasks with multiple steps are vulnerable to interruption due to network issues, machine reboots or crashes, timeouts, power outages, etc. To avoid redoing expensive work, checkpoint the runbook at critical points, and assure that any runbook restarts do not redo that work.
  • Allow planned or manual interruptions
    • There are scenarios where you may want to intentionally suspend a running runbook. Examples include suspending a runbook job in order to wait for approval to continue, or suspending a runbook job to wait for fixes to unexpected or planned system issues.

How to Add Checkpoints to a Runbook?

Checkpoint-Workflow Activity

The Checkpoint-Workflow activity (alias Persist) is a standard PowerShell workflow activity and can be used in a runbook to create a checkpoint at a particular point. The checkpoint is made at the point in the runbook where the Checkpoint-Workflow activity occurs.


Download-Updates
Reboot-VM
Checkpoint-Workflow
Email-Team
Checkpoint-Workflow

-PSPersist Activity Common Parameter

Whenever you call an activity you can include the –PSPersist common parameter. This will force the creation of a checkpoint immediately after the activity completes.


Download-Updates
Reboot-VM –PSPersist $True
Email-Team –PSPersist $True

$PSPersistPreference Workflow Preference Variable

In a runbook, you can include the statement $PSPersistPreference = $True. The effect of this is to cause a checkpoint to be taken after each activity which follows the preference statement. If you set this preference at the start of the runbook, then a checkpoint will be made after each activity in the runbook. You can turn off the automatic checkpointing by including the statement $PSPersistPreference = $False (which is the runbook default), after which activities will run without automatic checkpoints.

Note that for performance and strategic reasons persisting after each activity may not be the best approach. Each checkpoint requires processing to serialize the workflow state and store it in the database. Also, there are scenarios (example later) where if the runbook is suspended you will want to repeat multiple activities.


$PSPersistPreference = $True
Download-Updates
Update-VM
Email-Team
$PSPersistPreference = $False

Suspend-Workflow Activity

When the Suspend-Workflow activity is used in a runbook the immediate response is to checkpoint the runbook and then suspend it. You would use this activity in a runbook, for example, if you need the runbook to do some work and then to wait for approval to continue.


Download-Updates
# Get permission to apply updates
Suspend-Workflow
# Continue if resumed
Reboot-VM –PSPersist $True
Email-Team –PSPersist $True

Where to Add Checkpoints

In general, it is best to be explicit in where you want to persist your workflow. Rather than setting the $PSPersistPreference variable to get blanket checkpointing after each activity, it is typically better to be thoughtful and strategic and use the Checkpoint-Workflow or Suspend-Workflow activities or –PSPersist parameter in those places in your workflow where persistence makes most sense. There are places where you definitely want to persist a workflow, and there are places where you definitely do not want to persist a workflow (examples below). Also, keep in mind that persisting a workflow requires work from the system and will affect workflow performance by some amount.

Best Practice: You will want to add checkpoints in your workflow in these cases:

  • After any activity that you do not want to repeat (not idempotent).
  • Before any activity that has higher than normal probability of issues that could lead to failure and workflow suspension. You want to repeat the activity when the workflow resumes to assure that the activity work gets done. Examples include activities that access remote systems that may be susceptible to network issues.
  • After any long-running or expensive activity that you would not want to repeat due to cost.

Illustrative Scenario: Update VM

  1. Download the latest patches from Windows Update
  2. Restart the VM to apply the patches
    • Checkpoint
  3. Email the team to report that updates were applied
    • Checkpoint

In this scenario, it is ok to repeat step 1 (idempotent), but not steps 2 or 3. Thus, checkpoints are certainly needed after steps 2 and 3. Automatically persisting after each activity would also work; however, adding a checkpoint after step 1 unnecessarily adds work to the system.

Illustrative Scenario: Notify Customers

  1. Get list of customers from database
  2. Email customers about new policy
    • Checkpoint
  3. Email management that customer email went out
    • Checkpoint

Sometimes you have groups of activities that you don’t want to repeat, but only if all activities in that group succeed. In this scenario, Steps 1 and 2 should always be run together, to assure that the list of customers retrieved is up to date when the email goes out. Thus, if the runbook worker crashes before step 2 (sending the customer emails), when the runbook job resumes, we want it to start from step 1 again (retrieve customer list). However, if there is a crash or suspension just before step 3, then we want to assure that step 2 is not repeated (don’t want to email the customers again).

Best Practice: It is important to remember that you cannot add checkpoints within InlineScript blocks or functions in a workflow. This is because the code in InlineScript blocks and functions runs as pure PowerShell script and not as workflow. Thus, in order to take advantage of workflow persistence, as a best practice you should split your workflow code into multiple modular activities to allow you to add checkpoints between activities, or if you need InlineScript then use multiple InlineScript blocks to allow checkpointing between them.

Suspending and Resuming Runbooks

Checkpoints and suspending/resuming runbooks go hand in hand. You add checkpoints to a runbook so that if the runbook is suspended the runbook can be resumed from the latest checkpoint.

A runbook job in SMA can be suspended in several ways:

  1. Intentionally in the SMA portal UI
    • Using the SMA portal UI you can select to suspend a running runbook job.
    • The job will be suspended at the next checkpoint. If you have not authored any checkpoints into the runbook, then the runbook will continue running to the end, all the while showing a status of “Suspending”.
  2. Intentionally within a runbook using Suspend-Workflow
    • Include the Suspend-Workflow activity in a runbook.
    • The job will be checkpointed and then suspended at the place where Suspend-Workflow is called.
  3. Intentionally using the Suspend-SmaJob cmdlet
    • From a PowerShell script you can use the Suspend-SmaJob cmdlet to suspend a running SMA runbook job.
    • The job will be suspended at the next checkpoint. If you have not authored any checkpoints into the runbook, then the runbook will continue running to the end, all the while showing a status of “Suspending”.
  4. Unintentionally by the SMA workflow engine after a runbook exception
    • When a running job throws an exception it will be unloaded from the runbook worker and its status will be set as “Suspended”.
  5. Unintentionally due to a runbook worker crash
    • If a runbook worker crashes, the jobs that are running on that worker will terminate immediately. The state of these jobs in the database will remain as “Running”.

A runbook job in SMA can be resumed in several ways. In all cases, the job will resume from the last checkpoint, or from the beginning if there is no checkpoint.

  1. Manually in the SMA portal UI
    • Using the SMA portal UI you can select to resume a suspended job.
  2. Using the Resume-SmaJob cmdlet
    • From a PowerShell script you can use the Resume-SmaJob cmdlet to resume a suspended job.
  3. Automatically following a runbook worker crash
    • When the worker comes back online or when another worker is assigned, the worker will look for jobs in the database that are assigned to it. For any jobs that have state of “Running” and which are not yet running on the worker, the worker will automatically resume them.
    • If a runbook worker crashes and is not recoverable, you can have another worker pick up its jobs by using the New-SmaRunbookWorkerDeployment cmdlet.

Summary

As you can see, adding checkpoints to your runbooks is important if you want to take advantage of this key feature of PowerShell Workflow and create interruption-resilient runbooks. Adding checkpoints is easy. With a little forethought during runbook authoring you can protect your long-running and expensive tasks from unexpected interruption and truly create robust, reliable runbooks.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Really awsome post! Thanks!