Tim McMichael

Navigating the world of high availability...and occasionally sticking my head in the cloud...

Exchange and VSS -- My Exchange writer is in a failed retryable state…

Exchange and VSS -- My Exchange writer is in a failed retryable state…

  • Comments 40
  • Likes

In Exchange 2007 and Exchange 2010 many customers are leveraging VSS based backups to retain and protect their Exchange data.  By default Exchange provides two different VSS writers that share the same VSS writer ID but are loaded by two different services.  The first is the Exchange Information Store VSS writer and the second is the Exchange Replication Service VSS writer.  The Information Store writer allows for the backup of active / mounted databases and the replication service writer allows for the backup of passive databases (should a replicated database model be utilized).  You can see the writers by running the command VSSADMIN LIST WRITERS from a command prompt.

 

Here is a sample put of a VSSAdmin List Writers from a Windows 2008 R2 SP1 server with Exchange 2010 SP1.  Note how both writers share the same writer ID within the VSS framework.

 

Writer name: 'Microsoft Exchange Replica Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [1] Stable
   Last error: No error

 

Writer name: 'Microsoft Exchange Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

In the course of protecting Exchange servers there maybe conditions that cause a backup job to fail.  When an Exchange backup job fails the VSS framework aborts the backup and subsequently Exchange clears the backup in progress settings.  When a failure is encountered either a single Exchange writer or both Exchange writers maybe left in a FAILED RETRYABLE state.  We can utilize VSSAdmin List Writers again to query the writer status and see these results.  Here is an example showing the Exchange Replication Service writer with a status 8 FAILED last error RETRYABLE.

 

Writer name: 'Microsoft Exchange Replica Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {17e8df11-a8a2-4ee3-a3fb-e552b7da2d83}
   State: [8] Failed
   Last error: Retryable error

 

Writer name: 'Microsoft Exchange Writer'
   Writer Id: {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7}
   Writer Instance Id: {e0ad4b68-8938-4be5-9b88-4c74df2b2d65}
   State: [1] Stable
   Last error: No error

 

Now the typical question that comes up at this point is how do I actually deal with an Exchange writer that consistently disallows backups.  The answer – restart the service that the writer was associated with and/or fix whatever configuration issue is causing the failures.  For example, given the above output I would restart the Exchange Replication Service in an attempt to return the writer to a Stable No Error state.  (If it would have been the Microsoft Exchange Writer I would have restarted the Exchange Information Store Service).

The real question though is do I need to deal with a writer that is in a failed state?  Unfortunately many administrators find themselves having to deal with a writer in a failed state because their experience is that while the writer is in a failed state subsequent backup jobs fail.  If reviewing the issues carefully what you’ll find is that the backup jobs are not failing because of a VSS failure but rather they are failing because a writer was found in a failed state.  From an Exchange / VSS perspective this is unexpected –> after all although the writer is failed the error is RETRYABLE –> essentially saying “hey…something failed but come on back and try me again…”

 

Let’s take a look at why this might be happening….

 

Within the VSS framework there are two states that we are interested in –> the Session State and the Current State.  When a VSS session is in progress, and an administrator runs VSSAdmin List Writers, the state that is displayed is the current session state.  When the VSS snapshot creation has completed, the current state becomes a session specific state and the status of the most recently completed session is copied to the current state.  At this point when the administrator runs VSSAdmin List Writers the state of the most recently completed session is displayed.  This is an important distinction  -->  the SESSION STATE AT THIS POINT REFLECTS THE STATUS OF THE LAST SESSION!  The status of the last session does not imply anything in regards to the success <or> failure of future sessions.

Now that we know where VSSAdmin List Writers gets its information we’ll take a look at how the backup process should progress.  (I’m going to attempt to present an overly simplified timeline of an expected backup)

The process starts with the VSS requester establishing a VSS session. 

 

image

 

After the session is established the VSS requester requests metadata from the VSS framework.

 

image

 

At this point the VSS request and VSS framework further progress the snap shot process by determining components and preparing the snapshot set.

 

image

 

Once the components and snapshot sets have been prepared the VSS requester issues a PrepareForBackup.  This in turns causes the VSS framework to prepare the components for backup.

 

image

 

After prepare backup is called the individual application level writers are now responsible for current writer status.  The VSS requester is now allowed to call GatherWriterStatus.  This call in turn should return the current writer status.  For example, current writer status at this stage could be FREEZE / THAW / etc.  This is regardless of if the previous status was FAILED or HEALTHY.  This is the status that the VSS requester should be utilizing to make logic decisions at this point.

 

image

 

Once the snapshot is created the contents can then be transferred to the backup media.  Once the transfer is complete, the VSS requester can inform the VSS framework that a backup has completed successfully and subsequently the VSS session ended.

 

image

 

In summary if the VSS requester is performing operations in an order that is expected, the writer status should be queried after the framework has received a prepare for backup event.  This will ensure the writer status reflects that of the CURRENT SESSION IN PROGRESS and not the SESSION STATE OF THE PREVIOUS BACKUP.

 

The administrator can verify the functionality of the Exchange writer by utilizing the VSHADOW or DISKSHADOW utilities.  These utilities utilize the workflow outlined in the successful handling of a failed retryable writer case.  If either of these utilities are successful in creating the backup, and the writer in turn is returned to a healthy state you might consider following up with the backup vendor to ensure VSS calls are being made appropriately.  Microsoft can also assist you in verifying the calls are made appropriately through assisting with both Exchange and OS VSS tracing.

Comments
  • @Adam:

    Yes - that's what we're here for.

    TIMMCMIC

  • hi TIMMCMIC


    Me again. I was thinking about what you wrote about the steps you mentioned yesterday:

    1) Gather metadata.
    2) Call preparation
    3) Prepare snapshot
    4) Present snapshot
    5) Validate snapshot
    6) Copy snapshot to media
    7) Invoke backup complete


    Should I see in the debug logs of the 3rd party backup SW this line :

    IVssBackupComponents::BackupComplete.


    I’ve found this :

    “The requester indicates that the backup has completed by calling IVssBackupComponents::BackupComplete.”

    In this place :

    http://msdn.microsoft.com/en-us/library/aa384323(v=vs.85).aspx



    actually what I would like to know if any 3rd party backup SW would use those commands ?
    thanks for confirmation !


    Regards


    Adam

  • @ TIMMCMIC,
    if you really are here to help, please post some advice for all end users who are hitting the VSS_E_WRITERERROR_RETRYABLE error.
    Simply telling users to go to the third party backup software vendor will not solve the problem, it will delay resolution and it will make lots of people feel frustrated, including yourself as, in the end, they will likely come back to you.
    I have 15 years experience working with email databases like lotus notes and exchange on windows and as I am working at a Backup Sofware company I learned how to resolve Microsoft issues, issues which Microsoft should resolve themselves.
    What customers need is an answer from Microsoft as to what he needs to do to resolve the case, now that's what I call a helpful attitude. Pushing back is not what I call helpful.
    Thanks.

    Domenico.

  • @Domenico:

    Thanks for taking the time to comment. Considering your stated experience I thought you might appreciate the actual issue that is being highlighted in this article.

    When this article was first authored it was written for two reasons:

    1) Highlight a legitimate issue that existed in third party backup software.
    2) Explain what it means to have a retryable writer.

    At the time on a weekly basis we were seeing customer complaining that they had to restart the information store or replication service everytime they had a failed backup. The failures had nothing to do with the Exchange or Windows infrastructure - and were mostly linked to the inability to commit to media servers or agents loosing connections etc. So of course when this happens the backup fails and the writers go into a failed retryable state. The real complaint here was not why did the backup fails (most customers knew this) but why would Exchange / Windows force us to clear a writer to healthy before allowing us to backup again. In this instances windows server backup or diskshadow would backup without an issue clearly demonstrating the ability to exercise VSS and take a backup yet third party products were failing. What was discovered in this investigation was what is highlighted in this article as an issue - the VSS calls in question were being made out of order. The future success of backups should not be determined by the gather writer status prior to issuing an on prepare. The gather writer status should occur after the on prepare (allowing us to determine if the writer went from failed / retryable to preparing -> in which case VSS has successfully started).

    Ironically we still see advice to customers that indicate restarting services to reset writer status is a necessary pre-requisite for backups to function.

    So as to the issue that is documented here - although it is not necessarily running wild any longer (I cannot think of a vendor VSS log that I've seen this in quite a while) none the less it is still legitimate.

    To the overall concept of what it means to have a writer that is retryable I think this information is also still valid. The prevailing idea that just because a writer is retryable means something is broken and we need to fix that in order to be successful is not correct.

    Feel free to contact me through comments or the contact link on the blog if you'd like to discuss further.

    TIMMCMIC

  • Tim,
    what you are referring to are corner cases that you may have had with badly written VSS requestors;
    however my experience lists hundreds of cases per year with a well written requestor which won't be able to backup exchange simply because one of the exchange writers ('Microsoft Exchange Writer' or 'Microsoft Exchange Replica Writer') decided to sit in a bad state and only a service restart, enable disable, or in many cases an exchange server reboot would resolve the problem for a good while (days or months).
    Our VSS provider and VSS requestors are designed according to Microsoft guidelines on http://msdn.microsoft.com/en-us/library/aa384615%28v=vs.85%29.aspx, so in accordance with what you describe should be the correct way.

    I guess the reason why I keep replying you is that I don't like to see a well written blog with such a big flaw in it and I am referring to when you write about "the retryable erorr is re-tryable and the only thing to do is go to the third party software vendor". Although that may be the case for some other corner cases (that I haven't seen) that's definitely wrong for many others and it makes all backup vendors (including Microsoft partners) unnecessarily look bad.

    You also mentioned that in those cases end users can also seek help at Microsoft and that's great, but I guess the ask is to add some more useful tips on how else to resolve the problem and add a more detailed description of the error. the error is not a retryable erorr, not in my experience and you need to always restart some services or reboot to resolve it. if you look this error on google, you find tons of hit of people who resolve the case this way. What you described here so far is just a corner case that you have experienced and don't take me wrong, that's fine, as long as you list the other thousand cases which are generated because of other more valid reasons ;-)
    and just to be clear, I am referring to cases where windows backup software backup of same database copy FAILS too.
    hope that helps getting my point through this time.

  • @Domenico:

    I guess we will need to agree to disagree. At the time this blog post was written I'd say this was far from a corner case.

    To your broader point though - when diskshadow / betest / and windows server backup fail there is plenty of evidence that this is not a third party problem per sae. (I find a lot of times we introduce other issues not related to the original question just trying to get some of these things to work).

    I'll still stand by what I've said here though - the retryable error is not what needs to be fixed.

    TIMMCMIC

  • indeed we do disagree, the retryable erorr is the worse written error written by Microsoft ever and I will not point our customers to this thread, as it won't help them. end of story.

  • @Domenico

    No problem - most customers locate this thread on their own without needing direction...

    Should you want to further a discussion offline I can be contacted through the blog.

    TIMMCMIC

  • I'm still confused. If the VSS *is* retryable, but retrying won't get you a backup - what's the solution? You're saying the Backup software is misunderstanding the message. It seems to me what's needed then is a way to clear the warning(?) so the backup software thinks the VSS is Ok. How would one clear this warning so the VSS says No Error?

  • @MD2000...

    Maybe it's time for me to just write a different blog post - this one has taken on a life of it's own...

    To answer your question...

    The item addressed in this blog post was written when we saw a rise in VSS cases due to writer retry able. When we started tracing we found that if something had previously failed, and the writer was left failed retryable, all backups from that point forward would fail without VSS even being attempted. (IE - the backup software was making a determination that the server was unhealthy based simply on the writer status). Fortunately it's been quite sometime since we have seen an issue like that.

    The second thing I was trying to highlight here is that we still today see customers call and references from other vendors that you need your VSS writer fixed. It needs to be fixed because it's failed / retryable -> a failed retryable writer is a symptom / result not something that unto itself needs to be fixed.

    So in your particular case you've ended up with a VSS writer that is in a failed / retryable state. This state indicates that something failed -> come try it again. How do you go about fixing it - you go about finding the failure to begin with. (That sounds simple - but sometimes it is not so simple).

    A VSS backup really boils down to a few stages:

    1) Collection of VSS metadata and components.
    2) Execution of the snapshot.
    3) Verification of the exchange data (optional).
    4) Transfer of data to media.
    5) Notification of backup complete.

    To determine why a VSS writer is failed / retryable we need to start by looking at the application log. There is an event sequence that fires for each one of these items. You need to follow the event sequence through and see where the failure actually occurred.

    If the metadata and snapshot complete successfully -> and the errors occur in data transfer -> then we need to consult the backup job log. Actually - we really need to consult the backup job log anyway to see where it thinks there was a failure at too.

    I don't have time at the moment to list the actual event sequence - and it varies by Exchange version - but hopefully this helps you.

    BTW - to get the writer back to no error (which should not be necessary) - you need to restart the service associated with the writer.

    TIMMCMIC

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment