Today we're going to cover an issue that is more common than you might think, but is one that may be tricky to troubleshoot.  The first thing that you may think when you read the title of this post is, "What is the Performance team doing trying to figure out why the Cluster Service is acting up?"  This issue usually winds up with our Cluster team, but because of the root cause, we do a lot of collaboration with them on this.  Intrigued?  Good, read on ...

In this particular scenario, the problem presented as follows:  An active-passive 2-node Windows Server 2003 print cluster failed over for some reason.  When digging into what was going on, Node 1 of the cluster had lost communication with the rest of the cluster.  However, it had not actually dropped its network connection.  Looking at the state of the cluster service itself, the service was stuck in a "Starting" state on Node 1.  The problem with being stuck in this state is that we could not really do very much with the service when it is in this condition.  Even when we rebooted the server (with the service startup type still set to Automatic), the cluster service came up and got hung in a "Starting" state.  Meanwhile, Node 2 of the cluster was chugging along happily, servicing print requests and behaving itself - but your high availability print environment was one bad spooler crash away from an administrator's worst nightmare ...

Since we know that this is a print cluster, the first thought was, "Are there third party port monitors in play here?"  If so, what might be going on is this.  The Port Monitors run checks against the printers and update the registry for the printer.  The registry information for the printer is located in the cluster hive.  Changes to this registry location are replicated up to the quorum.  If you have a couple of hundred printers being handled by this port monitor then the server can get fairly busy.  When a node tries to join the cluster it is given a sequence number.  The problem is that the server may be so busy that it may not respond to the node that handed out the sequence number in time.  When this happens, the formed node sends out another sequence number to the node that is trying to join - and the endless loop continues.  On the joining node (Node 1 in our example), the Cluster Service will just sit in a starting state.  So, what's the fix?  There are two possible ways to address this, the first being to take the spooler resource offline until the cluster node has successfully joined the cluster, and the second being to remove the third party print monitor.  Obviously, there are a couple of problems with either approach - the first one being downtime (taking the spooler resource offline), and the second being the amount of administrative overhead required to clean up the third party port monitors.  Now, here's the rub ...

... in this particular instance, there were apparently no third party port monitors installed on the cluster server.  So what was going on?  Looking at the Cluster Log (which was several MB in size), we found that there were several printers that had notifications repeated over and over (a sample is below):

00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumSendUpdate:  Locker waiting        type 1 context 4098
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] Thread 0x900 UpdateLock wait on Type 1
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumpDoLockingUpdate: lock was free, granted to 2
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumpDoLockingUpdate successful, Sequence=559445862 Generation=0
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumSendUpdate: Locker dispatching seq 559445862    type 1 context 4098
00000a30.00000900::<DATE>-17:16:38.309 INFO [DM] DmWriteToQuorumLog Entry Seq#=559445862 Type=4098 Size=212
00000a30.00000900::<DATE>-17:16:38.309 INFO [DM] DmUpdateSetValue 
00000a30.00000900::<DATE>-17:16:38.309 INFO [DM] Setting value of DsKeyUpdate for key Resources\<GUID>\Parameters\Printers\<PRINTER> to 0x00000000
00000a30.00000900::<DATE>-17:16:38.309 INFO [API] Notification on port  2bfb520, key 1 of type 64. KeyName Resources\<GUID>\Parameters\Printers\<PRINTER>
00000a30.00000900::<DATE>-17:16:38.309 INFO [API] Notification on port  2d61310, key 1 of type 64. KeyName Resources\<GUID>\Parameters\Printers\<PRINTER>
00000a30.00000900::<DATE>-17:16:38.309 INFO [API] Notification on port   1b6790, key 1 of type 64. KeyName Resources\<GUID>\Parameters\Printers\<PRINTER>
00000a30.00000900::<DATE>-17:16:38.309 INFO [API] Notification on port  2bfb520, key 36 of type 64. KeyName Parameters\Printers\<PRINTER>
00000a30.00000900::<DATE>-17:16:38.309 INFO [API] Notification on port  2d61310, key 36 of type 64. KeyName Parameters\Printers\<PRINTER>
00000a30.00000900::<DATE>-17:16:38.309 INFO [DM] DmWriteToQuorumLog Entry Seq#=559445862 Type=4098 Size=212
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumSendUpdate: Dispatching seq 559445862    type 1 context 4098 to node 1
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumSendUpdate: Locker updating seq 559445862    type 1 context 4098
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumpDoUnlockingUpdate releasing lock ownership
00000a30.00000900::<DATE>-17:16:38.309 INFO [GUM] GumSendUpdate: completed update seq 559445862    type 1 context 4098

Knowing that the various GUID's were printers, we took a look at these printer properties in the registry in the HKEY_LOCAL_MACHINE\Cluster\Resources\<GUID>\Parameters\Monitors key for each instance just to double check that there were no third party monitors.  The only thing listed was the Standard TCP/IP port.  So we went back to the Cluster Log and picked out the first ten printers in the Cluster Log.  The one I picked out for this post happened to be a Lexmark Optra S 1855 (MS) Printer.  This was a random selection and does not mean that there are specifically problems with the in-box driver for this printer model.  Below is the Printer Name, the Printer Model and the Dependent files:

PRINTERNAME01   =  Lexmark Optra S 1855 (MS) (LMPCLRES.DLL, TTFSUB.GPD, UNIRES.DLL, STDNAMES.GPD)

When we checked the files, everything appeared to be a Windows 2003 in-box driver.  To quote Alice in Wonderland, "Curiouser and Curiouser" ... so what exactly was causing all of the updates on these printers?  It turned out that the problem was being caused by Active Directory Publishing of Printers.  With Print Publishing, when the spooler starts and the printer is published, the printer is checked periodically to ensure that it is still there.  If it responds, all is well.  If not, then the printer is pruned from Active Directory.  This causes the DSKeyUpdate and DSKeyUpdateForeground registry values to be updated - which in turn causes the cluster service hang during startup while it tries to process these updates.  On the Active Directory side, when a printer is "pruned" it is effectively removed from the AD.  The printer itself is not deleted, you can still connect via \\servername\printer, however you cannot search for it in Active Directory.  What the GPO and the command do is to set a bit on the object in Active Directory that skips the check for the printer(s).  This does not affect the ability to print as we are not affecting the spooler service or the printer - it's an AD thing!

What you can do to alleviate this is to disable the policy setting "Allow pruning of published printers".  You can then use the Resource Kit utility SETPRINTER.EXE to unpublish and republish printers as needed - for example:

  • Setprinter <\\servername> 7 "dwAction=unpublish" (this unpublishes all printers on the server)
  • Setprinter <\\servername> 7 "dwAction=republish" (this republishes all printers on the server)
  • Setprinter <\\servername\printername> 7 "dwAction=unpublish" (this unpublishes the specificied printer)
  • Setprinter <\\servername\printername> 7 "dwAction=republish" (this republishes the specified printer)

So there you go - the Case of the Stuck Cluster Service.  As you can see, our initial assumptions about the case were incorrect, but by using the data, we were able to isolate the problem and come up with a resolution.  Until next time ...

- CC Hameed

Share this post :