This blog post is a contribution from Andy Li, an Escalation Engineer with the SharePoint Developer Support team.

This first case that I am going to talk about is a real customer’s issue.  The symptom is very typical and it’s just that the workflow is not creating the next task when the previous one was complete.  But the root cause turned out to be very complicated.  I’ll try to make things easier to follow and understand when I explain what’s going on.

Overview of the workflow code

I don’t want to show you the entire source code of the workflow, which would be an overkill and you will probably stop reading the rest of the article after reading 10 pages of source code.  So, let me just say that the workflow has two main pieces: One is the OnWorkflowItemChanged activities that handles the workflow item changed event.  The other is the OnTaskChanged activity that handles the task changed event.  When a task is complete the onTaskChanged activity will be triggered and the next task will be created and this cycle will go on until code logic determines the workflow can be finished.

What’s the issue with this workflow?

The issue is that after a task was complete the next task failed to be created.

Since the issue was not reproducible I asked the customer to provide us with the ULS logs from the time when he started the workflow until he sees the problem.

I try not to put the entire ULS log here, it is massive and doesn’t do a good job of explaining what’s going on, so I put it into an excel file that shows the timeline of each event and highlighted the ones that are important to understand the life cycle of the workflow.  If you are serious about learning how to troubleshoot SharePoint Workflow issues, download the excel file and follow what I am going to explain below.

The following is the re-constructed picture of the life time of the failed workflow instance:

(Note: all the row numbers refer back to the row number in the excel attachment)

1. On row #4, the workflow is processing the “OnWorkflowActivated” event.  As you know this is the first event that every SharePoint workflow needs to have.  So you can say this is basically the start of the workflow.

2. On row #7, the workflow was “Successfully persisted”.  This marks the end of the “OnWorkflowActivated” event processing.  The reason why the workflow is persisted is because there is no more work in the workflow.

3. On row #13, the workflow starts” processing event OnWorkflowItemChanged” event.  The reason why this event got triggered is because there is an update on the list item that is associated with the workflow.

As I mentioned in the beginning, the customer’s workflow contains several “OnWorkflowItemChanged” activities that will respond to the workflow item changed at some point down the line.  We don’t need to know all the details about the workflow design.  All we know at this point is that there is a “Workflow item change” and we need to deliver the event to the workflow.

4. Anytime SharePoint needs to process the event, it needs to load the workflow instance into memory, that’s why on row #15 there’s “Loaded instance data”.  And any “loaded instance data” should be followed by an ULS event “Successfully persisted workflow instance” a couple rows down to mark the event of the event processing.  This is very important because if there is only the “start” but no “ending”, that means the workflow is not persisted correctly.  This is exactly what happened in this case.

5. On row #18, you will notice that the context switched to OWSTIMER.exe process.  Again, we don’t need to know why it switches, but we know that the timer service “begin processing events for instance”.  And the instance data was loaded into memory on row #21 (“Loaded instance data”) for processing the events.

6. As I said, if there is a “Loaded instance data”, there should be a “Successfully persisted workflow instance” following it and that’s in rown #33.  Notice that we are still in OWSTIMER.exe and the timestamps are very close so we are pretty sure row #33 corresponds to row #21.

7. On row #45, we switched back to “w3wp.exe” and on row #47, we “being processing events”.

Do you see the problem yet?  On step #4 (row #15) we loaded the instance into memory, but we never see a “Successfully persisted workflow” event afterwards.  And timer job jumped in the middle and did a lot of stuffs and persisted the workflow.

In other words, the workflow instance w3wp.exe loaded in row #15 was still hanging around in memory and its data is outdated because timer persisted a new instance into database on row #33.

8. So here comes row #49 and row #51 (both are in w3wp.exe process) where we are trying to delivery OnTaskChanged event to the workflow and failed with QueueNotFound exception.  The reason is very clear because the workflow instance in w3wp.exe process was outdated and it did not reload the new instance which was persisted on row #33.

Summary

With the re-constructed picture of what had happened, I was able to re-create a much simpler reproduction of the issue by using multiple OnWorkflowItemChanged and OnTaskChanged activities.  After reviewing the source code I found there is a problem in the way SharePoint processes OnWorkflowItemChanged event.  The problem is that SharePoint didn’t forcedly unload the workflow instance once it finished processing the OnWorkflowItemChanged event.  This confirms our theory on step #4 and explains why there’s no corresponding “Successfully persisted workflow” event after row #15.

A hotfix for this issue has been included in December 2012 CU for SharePoint 2010 and if I am correct the same fix should also be included in the latest CU release for SharePoint 2013.

Download the ULS logs excel file from here

Hope this case study was helpful in understanding and troubleshooting workflow issues.