The Case of the Process Startup Delays

The Case of the Process Startup Delays

  • Comments 76

I’ve been extremely busy here at Microsoft and so haven’t had time to blog until now, but plan on getting back to posting regularly. Before I start with a look at a technical problem I ran into recently, I’m pleased to report that the Sysinternals integration is proceeding smoothly and that Bryce and I will unveil an exciting new tool when the site moves to its new home under Microsoft TechNet in late October.

 

I don’t use my laptop much when I’m not traveling, but I occasionally read email on it in the living room. Like most Windows users, I’m frustrated by occasional unexplained delays when I perform routine tasks like start a program or open a web page. Since joining my laptop to an internal Microsoft domain, I began experiencing regular delays when starting processes. With my Sysinternals tools arsenal in hand, I set out to investigate the root cause, suspecting that recently joining the laptop to the Microsoft domain played a role.

 

I began my research by first noticing that, after a delay of a few seconds starting a new process, processes I started within the following 30 seconds launched instantly. I therefore started Process Explorer, waited for 30 seconds, and then executed Notepad from Explorer’s Run dialog. Notepad didn’t appear in Process Explorer’s process tree during the expected delay, which implied the Explorer thread starting Notepad was experiencing the pause, not Notepad’s startup.

 

A look at the stack of the launching Explorer thread might give me a hint about the cause, but I was too impatient to look at each of Explorer’s over a dozen threads and so attached the Windbg debugger from the Microsoft Debugging Tools for Windows to Process Explorer, launched Notepad with Process Explorer’s Run dialog, and broke into the debugger. I opened the thread list in Windbg by selecting Processes and Threads from the View menu, selected the first one displayed, and then revisited the View menu to open the Call Stack dialog:

 

 

 

Stacks display with the most recently called function at the top, so the ZwWaitForSingleObject frame at the top meant that the thread was waiting on some object to become signaled. The stack frames further up the stack are in the RPCRT4 (Remote Procedure Call Runtime Version 4) DLL and the reference to the OpenLpcPort function told me that the thread was trying to initiate a RPC with another process on the same system.  It looked like the wait might be due to the GetMachineAccountSid call highlighted in the screenshot. Just as for domain user accounts, computers belonging to a domain have accounts and GetMachineAccountSid’s name implies that the function returns the Security Identifier (SID) of the computer’s domain account.

 

I set a breakpoint on the return from the call to GetMachineAccountSid in the OpenLpcPort function and after a short pause consistent with the startup delays, the debugger’s command prompt activated. The x86 calling convention is for function return values to be passed in the EAX register so I examined its value:

 

 

After translating the value to decimal with the “?” command I searched for 1789 in the global error definitions file, WinError.h, of the Platform SDK:

 

 

I scoured MSDN documentation and the Web and found essentially no information on the underlying cause for that error code. However, the term “trusted relationship failure” implies that the domain the computer is connecting to doesn’t trust the domain of the computer. But under the circumstances the error didn’t make sense, because I was disconnected from the network and even if the computer was trying to connect a domain, the only one it would connect is the one it belonged to.

 

On a hunch, though, I opened a command prompt and ran PsGetSid to see what error it would get when trying to look up the computer’s domain SID (a computer’s domain account name is the computer name with a “$” appended to it):

 

 

Sure enough, it experienced the same delay, which must be a network timeout, and got the same error. Then I used remote access to connect to the domain and ran the command again:

 

 

 

Further, after connecting to the domain I no longer experienced the startup delays. I disconnected, but continued to have delay-free process launches, even after 30 seconds. After rebooting and not connecting, though, the delays reoccurred.

 

At that point I decided to investigate the internals of GetMachineAccountSid. The stack trace showed that it calls into the Netlogon DLL, which performs its own RPC to a function called NetrLogonGetTrustRid. I knew that the Netlogon service runs inside of the Local Security Authority Subsystem (LSASS):

 

 

I attached Windbg to LSASS and set a breakpoint on NetrLogonGetTrustRid. After launching a new process I hit the breakpoint and saw that if a particular field in a data structure is NULL, Netlogon tries to connect to a domain controller, but if the connection fails for any reason, it blindly returns error 1789. However, when I connected to the domain the call succeeded and the value in the data structure filled in with the SID of the computer account, which persists even after a disconnect. That explained the change in behavior after I connected to the domain.

 

Turning my attention back to GetMachineAccountSid, I found that it caches the results of an error for 30 seconds before asking NetLogon to attempt to connect to a domain controller again. That explained the 30 second quick-start periods. A look through the code flow in the debugger also revealed that OpenLpcPort queries the computer SID as part of a check to see if it matches the SID passed as one of its parameters. If so, OpenLpcPort changes the SID to the SID of the Local System account before calling NtSecureConnectPort, mapping the domain SID to a local one. NtSecureConnectPort takes a SID as a parameter and will only connect to the specified Local Procedure Call (LPC) port if the port was created by the account that matches the SID.

 

I’d answered a number of questions, but the big one remained: why was an RPC happening during a process launch at all? The initial stack trace only went up as far as the NegotiateTransferSyntax frame, but there were obviously other frames that the symbol engine couldn’t determine. The stack display went further when I had hit the breakpoint I set in OpenLpcPort, though:

 

 

Near the bottom you can see the call to ShellExecCmdLine that the CRunDlg class, which is responsible for the Run dialog implementation, calls. That eventually results in what looks like the execution of shell execute hook extensions, and the one that makes the RPC call is implemented by the MpShHook DLL. I didn’t know off hand what that DLL was, but Process Explorer’s DLL view showed that it’s part of Windows Defender:

 

 

I suspected that the hook is part of Windows Defender’s real-time protection, which the Windows Defender team confirmed. Autoruns reports that Windows Defender registers the shell execute hook:

 

 

The mystery was solved!  Putting it all together:

  1. Explorer’s Run dialog calls ShellExecuteCmdLine
  2. ShellExecuteCmdLine calls out to shell execute hooks
  3. Windows Defender’s hook for real-time protection, MpShHook.Dll, calls RPC to communicate with the Windows Defender service, passing the SID of the service as an argument
  4. The RPC library calls GetMachineAccountSid to see if the SID matches the computer’s domain SID, in which case it would map the SID to the local system account SID
  5. GetMachineAccountSid performs an RPC to the Netlogon service to get the computer account’s SID
  6. If the computer account’s SID hasn’t been obtained already, Netlogon tries to connect to a domain controller
  7. If the domain controller connection fails after a timeout (the delay), Netlogon returns a trust-relationship failure error
  8. The Windows Defender RPC proceeds using the unmapped SID
  9. Windows Defender’s service performs real-time checks and then process launches

 

A little more research led me to conclude that the delay only happens under very specific circumstances where:

 

  • The system is running Windows XP 64-bit for x64 or Windows Server 2003 SP1
  • Windows Defender Beta 2 is active
  • The system is domain joined, but has not connected to the domain in the current boot session

32-bit Windows XP doesn’t perform the SID mapping in OpenLpcPort and Windows Defender doesn’t use a shell execute hook on Windows Vista. The Windows Defender team is looking at workarounds for the next release, but now that I understand the delay I can work around it. 

Leave a Comment
  • Please add 1 and 4 and type the answer here:
  • Post
  • Interesting research Mark! I've always suspected that Windows Defender does a whole lot of extra procedure calls, but I have more problems with our CM software, that does similar actions.

    I think this was one of the most technical blogs I've read so far - and by far the most easy to follow...so keep up the good work.

    Will follow this blog with interest from here on!
  • Thanks for the great post! It got me to do a bit of investigative work on my own. I've seen this on a few of other blogs on Technet too.

    The first think I see when I view this page (from Firefox) is popup with the title "Website Certified by an Unknown Authority". Why would it involve SSL was my first thought, it turns out that all the pictures in the post are linked by https instead of http.

    So how come these images are protected by ssl. I have two theories.

    a) It's so people who are using Internet Explorer and are viewing the site https://blogs.technet.com/markrussinovich/ isn't bothered by the message "The page contains both secure and nonsecure items, do you want to display the nonsecure items?"

    b) Mark was logged in to his blog authoring software through ssl and just copied the link to the image which in that state was an ssl link.

    I'm favoring theory b, I don't think that many people are viewing the site through ssl.

    But the mystery itself, why do I get this popup? The certificate for blogs.technet.com chains to a Root CA called "GTE CyberTrust Global Root".

    I open up the certificates in Firefox and it does indeed have a trusted root certificate for GTE CyberTrust Global Root. Strange indeed, so the chaining process is failing for some reason. The AIA information seems to be in order too.

    Before I fire up Wireshark to see what's happening I do a quick google search, and sure enough Larry beat me to it (http://blogs.msdn.com/larryosterman/archive/2004/06/04/148612.aspx) over two years ago. It turns out that Firefox doesn't follow the OID 1.3.6.1.5.5.7.48.2 when doing certificate chaining.

    At the end of Larry's post he said "Now all someone has to do is to file bugs against Mozilla and OpenSSL to get them to fix their certificate validation logic" I guess nobody did, or they ignored it.

    But I'm hoping Mark can verify if theory a or b was the correct one.
  • For looking up error codes, have you checked out DavidChr's err.exe? It's available on the intranet on that one server that I won't name publicly that has all the tools. It's also available on the internet downloadable from Microsoft, albeit probably in a slightly older version than the one posted internally. I use it about as frequently as I use SysInternals tools. That's *often*. My only gripe is that it doesn't include CLR exception codes yet.
  • I thought, they would be using Linux there at Microsoft ;-) So they use Windows themselves...

    SCNR
  • Good point, well made, Jack!!!

    The RMC delays on my PC are at best, annoying.
  • But why does the hook need to pass the SID anyway?
  • Kudos, Mark with you to track down issues like this maybe Microsoft product quality can only go up :)
    As far as the case of the certificate popup appearing in Firefox goes I saw Patricks comment and followed his trail to Larry's blog entry (http://blogs.msdn.com/larryosterman/archive/2004/06/04/148612.aspx). I went through the comments there and sure enough a bug was indeed filed in Bugzilla (https://bugzilla.mozilla.org/show_bug.cgi?id=245609). However, it seems there is some question regarding the RFC involved (http://www.ietf.org/rfc/rfc2459.txt) and its validity.
  • Mark, so NetrLogonGetTrustRid on the client side (PE in this case) performs RPC to a function of the *same* name in LSASS.exe? If so, is this just a convention used by NETLOGON when doing RPC? Or is this how RPC works in general? (eg. client side stub RPC's to real function of the same name on server side).

    thanks,
    Marc
  • Great work, any opinion on this bug as it pertains to the state of windows in general? What I mean is how do you feel about all this code injection surface area that you guys expose in autoruns?

    It has been bugging me more and more how windows has so many places code can be placed to startup with the system which seems to have led to the spyware epidemic and now the anti-spyware causing thier own problems.

    My point is it seems the solutions are poor bandaids (good autruns fighting bad autoruns causing more running process and recource usage over all), when this stuff should be baked into the design so that 1. There is a consistent well known place for any startup code to be placed (perhaps services and startup folder only). 2. If a program trys to install anything there the user is warned explicity. 3. If the user wants to remove the startup they only need look in 2 places(1 would be ideal) and simply disable or delete startup they don't like.

    Now that definetly only helps part of the mainy issues involved in spyware, but its sorta like buffer overuns, autorunners are probably 90% of the problem, and windows defender is like trying to use some C library or code analyzer counter overruns when you should be making a new platform (.net) that inherintly doesnt have those issues.

    Would nice to hear your perspective, as autoruns really brings it home when looking at all the obsucre places startup code can be placed and you guys made it.
  • Does anyone else think it's absolutely *insane* that launching Notepad should be this complex? Granted, when you pick it apart there may be a seemingly sane reason for each step, but when you step back and look at the big picture, the bottom line is that you can't simply launch Notepad without domains, RPCs and SIDs getting involved! It amazes me that by the time all this irrelevant overhead has taken place the system even remembers to actually launch Notepad at all.

    This is a perfect example of Microsoft's tendency to hypercomplexify everything it creates, and one of the reasons I can't stand them or their products. Unfortunately, I fear this is what will happen to the beautifully straightforward and simply functional Sysinternals tools.
  • Why has the old blog vanished? At least move the posts here...
  • Excellent article, truly informative.  Thanks for the great tools and information.  Your knowledge has helped on countless occasions.
  • Mark now has a blog at http://blogs.technet.com/markrussinovich/ and starts out with a typically...
  • It can be a help to get notified when you are logged on with cached credentials. It's a reminder that something is different about this session when problems present with consistency, but intermittently.

    HKLM\...\Winlogon
    "ReportControllerMissing"=dword:00000001

    HKCU\...\Winlogon
    "ReportDC"=dword:00000001

    See
    http://support.microsoft.com/kb/242536/
  • Mark said:  "The initial stack trace only went up as far as the NegotiateTransferSyntax frame, but there were obviously other frames that the symbol engine couldn’t determine. The stack display went further when I had hit the breakpoint I set in OpenLpcPort"

    On my windbg when the stack frames are all there for the walking (ie, nobody used EBP for their own scratchpad reg) then whacking the windbg stack backtrace More button will walk further w/o having to unwind the nested calls first.  Of course this was 64-bit Windows with which I'm far less familiar (what is that, rbp for amd64 / em64t ? or something completely unintelligible for Itanium ...).
Page 2 of 6 (76 items) 12345»