Users were unable to connect to their shares. John discovered that the Cluster service wasn't started, and that any attempts to start it resulted in an error 1068. He attempted to ping the virtual server's IP address and it returned a "request timed out" message. He got the same error when trying to ping the cluster node's public adapter.
When he got to the node he found the Cluster service in a Starting state. He soon discovered that he had no network connectivity to or from either Cluster node, and that their network cards were missing from "Network Connections" The only changes made to the network were just a few minor group policy settings to lock down permissions a bit. Maybe that had something to do with this? It looked like it was going to be a long night...
This is another fairly common problem. This is not really just a Cluster problem, but that is usually how it is presented to me. Of course if networking is not functional, then Cluster isn't going to work either. :) I have worked at least three of these issues in the last two months, and thought it warranted discussion since there isn't a public KB article on this particular scenario yet. I hope to fully document every error encountered here, so that others may find this post when they run into this situation. (KB articles sometimes take a while to get published)
System event log:
SAM event ID: 12291 "SAM failed to start the TCP/IP or SPX/IPX listening thread"IPSec event ID: 4292 "The IPSec driver has entered Block mode."DfsSvc event ID: 14523 "DFS could not contact any DC for Domain DFS operations."
Application event log:
EventSystem event ID: 4609 "The COM+ Event System detected a bad return code during its internal processing. HRESULT was 80004015 from line 142 of d:\nt\com\complus\src\events\tier2\service.cpp."
Other problems discovered with this node:
The Com+ Event System, Network Connections and Shell Hardware Detection services were in a Starting state.
The following services failed to start:
Cluster Service: Error 1068: The dependency service or group failed to start.File Replication: Error 1068: The dependency service or group failed to start.---dependencies opens up a window titled "Service Dependencies" and the message is: Wind32: Access is denied.IPSEC Services: Error 1899: The endpoint mapper database entry could not be created.System Event Notification: Error 1068: The dependency service or group failed to start.--trying to view the dependencies on the server returns the following message: Win32: Access is deniedTask Scheduler: "The endpoint mapper database could not be loaded"
We have three services failing with "the dependency service or group failed to start." When we try to view the dependencies we get an access denied message.
Let's look in the registry to see what each of these services depend on:
Cluster service: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvcDependOnService:
System Event Notification:
So the common dependencies are RpcSs and EventSystem
RpcSs is the Remote Procedure Call (RPC) service, and EventSystem is the Com+ Event System service. We know from earlier that Com+ Event System is one of the services stuck in a Starting state, so that is why the File Replication and System Event Notification services haven't started. One of the other dependencies for the Cluster service is NetMan, which is the Network Connections service. Network Connections is also one of the services stuck in a Starting state.
So now the real question is: Why are the Com+ Event System and Network Connections services not starting?
If we view the dependencies for these two services, we just find RpcSs listed. So it all boils down to RPC. However, the Remote Procedure Call (RPC) service is actually started.
If you do a search in the knowledge base on these errors, you are likely to come across this article:
909444 Systems that have changed the default Access Control List permissions on the %windir%\registration directory may experience various problems after you install the Microsoft Security Bulletin MS05-051 for COM+ and MS DTC
This discusses changes made by a hotfix that would cause these problems. The fix is to correct NTFS permissions on the %SystemRoot%\Registration directory. However the permissions here are the same as in the article.
You may also come across this one:
916254 COM+-related events may be logged in Event Viewer when you install Windows XP Service Pack 2 and join the computer to a domain
Most would come across this second article and instantly dismiss it since it says "Windows XP Service Pack 2." However, we have a lot of the same symptoms, and since XP SP2 and Server 2003 SP1 include a lot of the same security changes it warrants further investigation.One of the security changes in SP1 for Windows Server 2003 was to change the Logon Account used for RPC.RPC use to log on as Local System and now uses an account with less privileges: Network Service.
The article states that this issue occurs if the SERVICE account is missing from the policy setting "Impersonate a client after authentication"
We can see if SERVICE is missing from this policy by performing the following steps:
1. Open up Local Security Policy in order to see what the effective settings are:
Start, Run, secpol.msc
2. Expand Local Policies, User Rights Assignment and then open up "Impersonate a client after authentication"
At minimum the following should be listed: Administrators and SERVICE
The problem that I have seen recently happens when someone decides to change the "Impersonate a client after authentication" user right in group policy. Typically how it goes is they decide to lockdown their servers, and only give specific accounts certain privileges. However, after incorrectly removing the SERVICE account from this privilege the server loses all network connectivity. Fortunately this problem doesn't show up until after a reboot. (You have an opportunity to identify that the problem exists before causing a major outage of all servers in a large OU.)
The fix is simple for the servers that haven't been restarted:
1. Correct the policy and then force group policy to be reapplied. (gpupdate /force)
(To correct the policy: just add SERVICE and Administrators to this policy setting in addition to the other ones defined)
If you have already rebooted the servers after applying the incorrect policy settings they will not be corrected by just simply changing the policy back since they have already lost network access. (unless the policy change was made locally to begin with)
1. Export the following registry key:
2. In the services snap-in: Change Remote Procedure Call (RPC) to start up with the Local System account instead of Network Service, and then reboot
3. At this point the majority of the services should be started and we should now have network access. Ensure that the offending group policy has been corrected with the proper accounts, force group policy to apply, (gpupdate /force) and then reboot.
4. Change the logon account for Remote Procedure Call (RPC) service back to Network Service by importing the reg file that you exported in step one, and then reboot. Alternatively: navigate to the following reg key and then reboot
Change the ObjectName value from LocalSystem to: NT Authority\NetworkService
For more information regarding this security setting see article on Technet: SeImpersonatePrivilege I have commented KB 269229 to reflect the requirement for SERVICE to be included in this User Right.
Please let me know if you like the format of this post or if you have any questions.
Until next time.
Justin TurnerThis posting is provided "AS IS" with no warranties, and confers no rights.
Mr Turner. Great article tons of good info could you though increase some of the text size?
I appreciate the link Sir I will be visiting regularly.
Thank you for visiting. Text size will be increased with the next post. Thanks for the suggestion.
Just a quick note to say that they did update KB 269229 with my comment about requiring the SERVICE account
Yup, that fixed it,
We have successfully recovered both our domain controllers using this fix.
Appearantly someone on the development staff had changed the Impersonate Priviledge to work only for our service account, and not for the rest.
Development for the lose!
Nice to hear that it helped.
Thanks a million, your article has allowed us to get back up and running after a few hours of downtime. Basically all we did was change the logon for the RPC, back to Local System. So we now have network connectivity, Exchange and most importantly, Remote Desktop Connection, so we don't have to be lying on the floor at the local system in the server room :) Now we can look at sorting the policy settings you mentioned, from the comfort of our own desks.
Robert: You're welcome. I'm glad it helped.
Thankyou for an informative article. Your article saved me from having to do a server rebuild, as I had no idea what had gone wrong, until I came across this article on Google.
This happened on SBS2003 in my case - and as I'm the only Administrator, I'm at a loss to understand how the users Administrator and SERVICE were ever removed from "Impersonate a client after authentication" as I don't remember doing it!!!
Thanks again. :-)
Thanks for the feedback Tony.
As far as how it happened: Since you are the only Administrator, and don't remember doing, I would check for any rogue services or processes running.
You may want to go to http://safety.live.com and do a virus scan. Or maybe the settings got changed by importing one of the "High Security" templates that often get recommended by some of the security sites?
Thank you for spelling out step by step how to fix my 'sick' Domain Controller. I experienced EXACTLY what you outlined in this article and was able to fix it. Thank you!!!!
Thank you Justin,
This problem had plagued our network for a few months. I had only stumbled upon the temportary fix of setting each machine's RPC service to Local System Account, but it was just a bandaid on a gushing wound.
Thank you, Thank you, Thank you.
Just wanted to let you know you saved our bacon with this article. THANKS!
Justin you're the man, you saved my weekend (after foolishly applying a malformed security policy).
Your article is really helpful and important.
I think the title "Cluster service failure after AD lockdown" is a bit illusive, it doesn't reflect the real context of the problem. it can happen actually on any domain member (SQL server services also failed)
Thanks for the greats tips. Problem solved for me during Active Directory upgrade from win2k to win2k3.
I remeber that installation of Norton Antivirus Client Server Suite ask me to change impersonate key of domain group policy years old.
Thanks a lot
Michele Maran from Italy
We had some issues with 2003 SP1 and the Time Service - after a reinstall of SP1 to fix the issue, we had the COM+ and RPC issue also. In our case, the "Impersonate..." policy was never defined in the DCP. Just performing the final restart now, and i'd just like to take the chance to backup Meir's comment that indeed - Justin - YOU ARE THE MAN!!! :0)