Tim McMichael

Navigating the world of high availability...and occasionally sticking my head in the cloud...

Exchange and VSS -- My Exchange writer is in a failed retryable state…

Exchange and VSS -- My Exchange writer is in a failed retryable state…

  • Comments 38
  • Likes

In Exchange 2007 and Exchange 2010 many customers are leveraging VSS based backups to retain and protect their Exchange data.  By default Exchange provides two different VSS writers that share the same VSS writer ID but are loaded by two different services.  The first is the Exchange Information Store VSS writer and the second is the Exchange Replication Service VSS writer.  The Information Store writer allows for the backup of active / mounted databases and the replication service writer allows for the backup of passive databases (should a replicated database model be utilized).  You can see the writers by running the command VSSADMIN LIST WRITERS from a command prompt.

 

Here is a sample put of a VSSAdmin List Writers from a Windows 2008 R2 SP1 server with Exchange 2010 SP1.  Note how both writers share the same writer ID within the VSS framework.

 

Writer name: 'Microsoft Exchange Replica Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [1] Stable
   Last error: No error

 

Writer name: 'Microsoft Exchange Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

In the course of protecting Exchange servers there maybe conditions that cause a backup job to fail.  When an Exchange backup job fails the VSS framework aborts the backup and subsequently Exchange clears the backup in progress settings.  When a failure is encountered either a single Exchange writer or both Exchange writers maybe left in a FAILED RETRYABLE state.  We can utilize VSSAdmin List Writers again to query the writer status and see these results.  Here is an example showing the Exchange Replication Service writer with a status 8 FAILED last error RETRYABLE.

 

Writer name: 'Microsoft Exchange Replica Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [8] Failed
   Last error: Retryable error

 

Writer name: 'Microsoft Exchange Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

 

Now the typical question that comes up at this point is how do I actually deal with an Exchange writer that consistently disallows backups.  The answer – restart the service that the writer was associated with and/or fix whatever configuration issue is causing the failures.  For example, given the above output I would restart the Exchange Replication Service in an attempt to return the writer to a Stable No Error state.  (If it would have been the Microsoft Exchange Writer I would have restarted the Exchange Information Store Service).

The real question though is do I need to deal with a writer that is in a failed state?  Unfortunately many administrators find themselves having to deal with a writer in a failed state because their experience is that while the writer is in a failed state subsequent backup jobs fail.  If reviewing the issues carefully what you’ll find is that the backup jobs are not failing because of a VSS failure but rather they are failing because a writer was found in a failed state.  From an Exchange / VSS perspective this is unexpected –> after all although the writer is failed the error is RETRYABLE –> essentially saying “hey…something failed but come on back and try me again…”

 

Let’s take a look at why this might be happening….

 

Within the VSS framework there are two states that we are interested in –> the Session State and the Current State.  When a VSS session is in progress, and an administrator runs VSSAdmin List Writers, the state that is displayed is the current session state.  When the VSS snapshot creation has completed, the current state becomes a session specific state and the status of the most recently completed session is copied to the current state.  At this point when the administrator runs VSSAdmin List Writers the state of the most recently completed session is displayed.  This is an important distinction  -->  the SESSION STATE AT THIS POINT REFLECTS THE STATUS OF THE LAST SESSION!  The status of the last session does not imply anything in regards to the success <or> failure of future sessions.

Now that we know where VSSAdmin List Writers gets its information we’ll take a look at how the backup process should progress.  (I’m going to attempt to present an overly simplified timeline of an expected backup)

The process starts with the VSS requester establishing a VSS session. 

 

image

 

After the session is established the VSS requester requests metadata from the VSS framework.

 

image

 

At this point the VSS request and VSS framework further progress the snap shot process by determining components and preparing the snapshot set.

 

image

 

Once the components and snapshot sets have been prepared the VSS requester issues a PrepareForBackup.  This in turns causes the VSS framework to prepare the components for backup.

 

image

 

After prepare backup is called the individual application level writers are now responsible for current writer status.  The VSS requester is now allowed to call GatherWriterStatus.  This call in turn should return the current writer status.  For example, current writer status at this stage could be FREEZE / THAW / etc.  This is regardless of if the previous status was FAILED or HEALTHY.  This is the status that the VSS requester should be utilizing to make logic decisions at this point.

 

image

 

Once the snapshot is created the contents can then be transferred to the backup media.  Once the transfer is complete, the VSS requester can inform the VSS framework that a backup has completed successfully and subsequently the VSS session ended.

 

image

 

In summary if the VSS requester is performing operations in an order that is expected, the writer status should be queried after the framework has received a prepare for backup event.  This will ensure the writer status reflects that of the CURRENT SESSION IN PROGRESS and not the SESSION STATE OF THE PREVIOUS BACKUP.

 

The administrator can verify the functionality of the Exchange writer by utilizing the VSHADOW or DISKSHADOW utilities.  These utilities utilize the workflow outlined in the successful handling of a failed retryable writer case.  If either of these utilities are successful in creating the backup, and the writer in turn is returned to a healthy state you might consider following up with the backup vendor to ensure VSS calls are being made appropriately.  Microsoft can also assist you in verifying the calls are made appropriately through assisting with both Exchange and OS VSS tracing.

Comments
  • Hi...so you suggesting the logic need to be fixed from backup vendor rather than MS? Believe vendors have worked together with MS before releasing the product and most likely they already aware of the issues/fixes? This affects all backup products...

  • @TT:

    I am suggesting that there are occassions where this type of issue occurs becuase of the order of calls the backup vendor is making.

    More importantly what I'm suggesting though is that a failed writer is not necessarily something that needs to be fixed.  In supporting these types of cases what i've noticed is a lot of attention paid to the state of the writer and "fixing" the writer from a failed state.  In theory the writer being failed is simply telling you that a previous operation failed and should not preclude the taking of future backups (and therefore is not something that requires fixing before continuing to troubleshoot a backup issue).

    TIMMCMIC

  • but most of the time after fixing that into stable/no error it works.. :-)

  • @TT:

    In most cases a writer left in a failed state means the previous backup failed.  What you are hopefully looking at is what lead up to the failed writer and not the failed writer itself...

    TIMMCMIC

  • @TIMMCMIC: What then, if TSM TDP backup fails consistently for weeks with the following error:

    ANS5261W An attempt to create a snapshot has failed.

    ANS1327W The snapshot operation for 'INT-EXCHDB-01\Microsoft Exchange Writer\{76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}\Mailbox Database 09\c2244697-3994-4587-8234-e2bc9bbd4e79' failed with error code: -1.

    Clearly, the VSS "Last Error" message indicates that something isn't working as intended.

    Writer name: 'Microsoft Exchange Replica Writer'

      Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}

      Writer Instance Id: {15642e98-5677-43ce-9a56-8dd1a6f32745}

      State: [7] Failed

      Last error: Retryable error

    Writer name: 'Microsoft Exchange Writer'

      Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}

      Writer Instance Id: {bddc9190-b6e2-4a64-b07b-58280cd7de33}

      State: [7] Failed

      Last error: Timed out

    What is the backup application supposed to do other than report the problem and ask Microsoft to please fix their product so it doesn't fall over every time someone tries to use it in the real world?

  • Andreas:

    First and foremost I think it's a great assumption on your part that there's some deficiency in either VSS or Exchange that is causing your issue.

    If you read the error you'll see that it indicates a timeout has occured.  VSS / Exchange has 30 seconds to allocate the snapshot in order to satisfy the backup.  When a timeout error occurs it's usually related to anciallary items:

    1) The incorrect volume formatting is utilized (for example an Exchange server should utilize 64K formatting not the default 4K)

    2)  The volume is defragmented (being exacerbated by item 1 in this list).

    3)  The hardware is having issues at the time of the backup.

    Maybe you should open a support case and allow the issue to be investigated.  I would also have expected the backup vendor to also have some insight into these types of issues.

    TIMMCMIC

  • Hi TIMMCMIC,

    what do you mean when you say "The volume is defragmented (being exacerbated by item 1 in this list)."

    did you mean the volume is fragmented?  Please let us know.

  • Been with this for awhile and now again it's happened on my Exchange 2007 and Veeam Backup.

    11/16/2012 8:43:20 PM :: Unable to release guest. Error: Unfreeze error: [Backup job failed.

    Cannot create a shadow copy of the volumes containing writer's data.

    A VSS critical writer has failed. Writer name: [Microsoft Exchange Writer]. Class ID: [{76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}]. Instance ID: [{8ea7190d-337c-448f-b264-3401303b586b}]. Writer's state: [VSS_WS_FAILED_AT_FREEZE]. Error code: [0x800423f2].]

  • @Habibablby:

    Without the full application log it's going to be hard to predict why the freeze of Exchange is failing.

    I would suggest opening a case with PSS if you could.

    TIMMCMIC

  • Tim, First of all, thanks for the great article. the frustrating problem people face and which I felt it is not 'spelled out' well in this article is that when VSS writer goes into a retryable state, you can try as many times as you want, but backup will fail. So it is not what it says it is, if you know what I mean. You seem to be under the impression to take the error literarly, but if you do have some backup experience, I am pretty sure you will be disappointed and find out that it isn't retryable as it says it is. Backup may fail for whatever reason and we are not asking VSS to fix all backup related issue, of course. I think the problem with VSS writer is that even if you have fixed the root cause for the writer to be in a error state, the writer won't allow you to backup exchange. for the VSS requestor software makers, this is a problem because customers keep coming to the backup application software and they want an answer from them. What customer need is an answer from Microsoft as to why you need to restart VSS writer related services to clear the state. So, my experience with Exchange is that ONCE Microsoft Exchange Writer goes into a bad state, it is not possible to get out of it without restarting a bunch of microsoft exchange related services. Now, it seems you have been focusing on the reason why this happened in the first place. That's good so you can prevent the same problem from happening next time. but it will not clear the state of the writer. hope that clarifies this for all readers. Domenico.

  • @Domenico: I appreciate the feedback. Still though -> an exchange writer that is in a failed retry able state still does not need to have services restarted in order to "fix" something. Take an example that I see pretty regularly. We're using third party software X. Third party software X experiences a failure to communicate with a central server. This causes that backup to "fail" and leaves the writer in a "failed retry able state". I then take that same product and I attempt to perform a backup with third party software X -> and it's successful! Yes - the operation is completely successful when the writer was originally found in a failed retry able state. And yes - in many cases a writer in failed retry able has nothing to do with VSS itself and most likely should be referred to the third party backup software. You can also observe the same thing using the VSS tester. In many cases when the writer is failed retry able the diskshadow / vss tester script will work just fine and a backup is taken successfully. Further evidence that a failed retry able writer is not necessarily something that needs to be fixed. Now - there are several cases where the writer is failed retry able and subsequently all backup attempts fail not only with third parties but also with the vss tester script. This usually indicates some form of legitimate issue within Exchange or VSS. TIMMCMIC

  • Okay, but what if it's DISKSHADOW that is having this issue?

  • @8bit_pirate... Thanks for the comment. It would depend on what it means for diskshadow to have an issue. The writer being in a failed retryable state prior to running disk shadow should not be an issue and should not cause disk shadow to fail. If the writer is found in this state - and diskshadow fails - then it's possible there is an issue with core VSS that needs to be investigate. TIMMCMIC

  • Getting the Retryable Error always on the same DB. When i restart the Replication Service, i am able to back up the DB but during the nightly backup schedule always the same DB gets hung on retryable. Why?

  • Hi Tim, you mentioned "..assisting with both Exchange and OS VSS tracing". It appears it is posible to trace VSS just for Exchange? If that's so, can you please tell me or point me to procedure that explains how to perform Exchange VSS tracing? Thanks in advance!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment