Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
I was in Barcelona a couple of weeks ago speaking at Microsoft’s TechEd/ITForum conference, where I delivered several sessions (two, Advanced Malware Cleaning and Windows Vista Kernel Changes earned the top #1 and #2 rated breakout sessions for the week - you can see an interview of me at the conference here). The conference was a huge success and Windows Vista, which I had taken on the road for the first time, performed great. However, as I was running through some demos before one of my sessions, I noticed that the file open dialog, which is common to all Windows applications, would often take between 5 and 15 seconds to appear.
I didn’t have time to investigate before my talk, so the delays caused me consternation when they showed up during my Windows Vista Kernel Changes session immediately afterward. The behavior felt uncannily like the one I wrote up a few blog posts ago in The Case of the Process Startup Delays. In that case, Windows Defender’s Remote Procedure Call (RPC) communications during process startup tried to contact a domain controller, which resulted in hangs when the system was disconnected from its domain. I mumbled excuses on behalf of Windows Vista and tried to distract the audience by explaining the subsequent demonstrations.
It wasn’t until the plane ride home that I got a chance to look into it. I followed steps similar to the ones I had when I explored the Windows Defender hangs. I launched Notepad from within Debugging Tools for Windows’ Windbg tool, typed Ctrl+O to open the File Open dialog, and when I got the hang broke in and looked at the stack of Notepad’s main thread:
If you haven’t seen a stack before, it’s a history from most recent to least of nested functions called by a thread. You read it from bottom to top, so the stack shows that Notepad had loaded Browseui.Dll and called its CAddressBand::SetNavigationState function. That function called CBreadcrumbBar::SetNavigationState, which called CBreadcrumbBar::SetIDList, and so on.
A look at the function names on the stack immediately told me what was happening: when you access the Open dialog the first time within an application it navigates to your documents folder. On Windows Vista my folder is C:\Users\Markruss\Documents, but the shell wants to make the path in the dialog’s new “bread crumb��� bar pretty by displaying it as “Mark Russinovich\Documents”, and so it calls GetUserNameEx to lookup my account’s display named as it’s stored in my User object in Active Directory. I confirmed my theory by verifying that the first parameter SHGetUserDisplayName passes to GetUserNameEx, which is interpreted as the EXTENDED_NAME_FORMAT enumeration, is 3: NameDisplay.
I set a breakpoint on the call’s return and hit it after the delay completed. GetUserNameEx returned the ERROR_NO_SUCH_DOMAIN error code, and stepping through SHGetUserDisplayName revealed that it falls back to calling GetUserName. Instead of looking up the user’s display name, that function just obtains the Security Identifier (SID) of the user from the process token (the kernel data structure that defined the owner of a process) and calls LookupAccountName to translate the SID to its account name, which in my case is simply “markruss”. Thus, the dialog that appeared looked like this:
As opposed to this, which is what I saw when I got back to the office and connected to the corporate network:
I had solved the case, but was curious to know where exactly the delay was taking place and so continued by researching what was happening on the other end of the Secure32!CallSPM call that’s on top of the stack listing. I knew that the Local Security Authority (LSASS) process is responsible for authentication, including interactions with domain controllers and account name translations, so I attached Windbg to the Lsass.exe process (make sure that you detach the debugger from LSASS before exiting with the “qd” command, otherwise LSASS will terminate and the system will begin a 30-second shutdown). I figured that Secur32.Dll acts like both a client and server and confirmed that it was loaded into LSASS, but I needed to determined the server-side function that corresponds to Secur32!SecpGetUserName. I did so by brute force: I dumped the functions implemented by Secur32.Dll and looked for ones with “name” in them:
I set breakpoints on several of them and when I reproduced the delay I hit the one on SecpGetUserName and stepped through it to eventually get to this stack:
The DsGetDcName function is documented as returning the name of a domain controller in the specified domain. SecpTranslateName obviously need to find a domain controller to which to send the account display name query. I traced further, and discovered that LSASS caches the result of the lookup for 45 seconds, which explained why I didn’t see the delay if I ran a different application and accessed the File Open dialog immediately after getting a delay. Then I hit a temporary dead-end when Netapi32!DsrGetDcNameEx2 executed a RPC request.
Again, figuring that Netapi32 acts like a client and a server, I dumped its symbols and set breakpoints on functions containing “dc”. I let LSASS continue executing and to my surprise hit the exact same function, Netapi32!DsrGetDcNameEx2. I traced into the call deeper and deeper until the thread finally called into the kernel (Ntdll!KiFastSystemCallRet):
I was close to the end of my investigation. The last question I had was what device driver was Netlogon calling to send a browser datagram? I answered this by looking at the first parameter it passed to NlBrowserDeviceIoControl, which I guessed was a handle to a file object. Then I opened Windbg in Local Kernel Debugging mode (note that on Windows Vista you have to boot in debugging mode to do this), which lets you look at live kernel data structures, and dumped the handle’s information. That showed me the device object that was opened, which told me that the driver is Bowser.sys, the “NT Lan Manager Datagram Receiver Driver”:
I thought my investigation was complete, but when I later tried to reproduce the delays I failed. I retraced my footsteps and found that LsapGetUserNameForLogonSession caches the display name for 30 minutes. Further, an account’s display name is cached with cached credentials so you won’t experience the delays for the first 30 minutes after logging in or disconnecting from the corporate network. I confirmed that by waiting 30 minutes and reproducing the hangs.
My investigation had come to a close. I had determined that Windows Vista’s File Open dialog tries to look up a user’s display name for the “bread crumb” bar when showing the documents folder and in the process tries to locate a domain controller by sending a Lan Manager datagram via the Bowser.sys device driver. I also knew that there’s no workaround for the delayed dialogs and that anyone that has a domain joined system that’s not connected to their domain will experience the same delays - at least until Windows Vista Service Pack 1.
Does this delay also occur if your network adapter's media connection is 'disconnected' or only if you're connected to some physical network but without a network path to your domain controller?
Great detective work as always.
It is however a shame no one bothered to disconnect from their domain and test this before RTM. Millions of users will be quite unhappy having to wait for something to open (for longer then usual) on their brand new notebooks.
...not to mention the DevStudio 2005 help viewer which invalidates the entire sidebar every time you click on anything in it. And the CLR.
They've gotten badly off track.
Things will be dismal for a while until competent competitors appear (let us pray that they do...).
IMO, this is part of a systemic design error in the shell that has existed since at least Windows 95: the shell is far too synchronous.
The shell should never be making a blocking network request in the GUI thread. This is only the latest example of the shell being unresponsive because it is blocking on some event that, predictably, can and will take a long time to complete or timeout.
The problem is so big that the kernel team has had to implement a special syscall to cancel a synchronous IO operation in another thread. This is at best a work-around: the correct solution would've been to have the GUI thread not actually block on these events, either directly via asynchronous IO, or with an army of worker threads to do the blocking (with async notification back to the GUI thread).
The last thing that should be happening is for the GUI thread itself to block on a superficial username lookup, network computer enumeration, domain controller lookup, CD-ROM volume names, or any other kind of lengthy IO. This is, unfortunately, the current design of the shell.
CUsersFilesFolder::GetDisplayNameOf shouldn't be calling GetUseNameExW in the GUI thread (unless it somehow knows the answer has been cached and won't block.) Instead, it should display the most correct answer it can retrieve immediately, and have a worker thread block on GetUserNameExW. When GetUserNameExW returns, it can inform the GUI thread that a better display name has become available and the GUI thread can update accordingly. I could see the GUI thread waiting for a response for a short time from the worker (say 200ms) to help avoid flicker. This strategy should be applied to all shell display elements that might block for nontrivial amounts of time.
I never use the network neighborhood because it WILL block for 10+ seconds, leaving the entire window unresponsive.
I know that Win95 didn't support async IO and couldn't afford to create tons of threads, but they still could've used two (one for GUI, one for background) Besides, I thought Vista was supposed to fix this kind of thing.
PingBack from http://www.itwriting.com/blog/?p=62
Foolhardy: The problem is not just the Windows shell, it is the design of the Win32 API. Loads upon loads of functions that can and will block for long periods of time are available in synchronous versions only, for "ease of use". They're "easy to use" alright, but the result is a crappy experience for the user.
> Does this delay also occur if your network
> adapter's media connection is 'disconnected'
Doesn't seem to, I just tried it. No delay.
CypherBit: "It is however a shame no one bothered to disconnect from their domain and test this before RTM."
Well, bugs happen, and the fact that it caches the information for half an hour makes it even harder to repro.
Even if a fix is written today, it will be sitting in on a disk drive somewhere for the next 6 to 12 months until Vista SP1. With guys like Mark on staff, Microsoft can write bugfixes, but their process prevents them from shipping bugfixes in a reasonable time frame. You see, it has to be tested with the Urdu distribution, and we have to ensure no bad interactions with the Zune player, and ...
Dave: Of course bugs happen, but this one does seem a bit obvious. Not the fact what's wrong, but that something is wrong. I'd imagine a lot of testers were disconnected from their DC for quite some time, they might have even traveled and it just surprises me no one noticed this. That's all.
Would it be possible to simply hit a registry value to increase that cache time from 30 minutes to something longer?
Or maybe there's a way to hit a registry value to change this behavior so that populating the breadcrumb bar will skip over the GetUserDisplayName, and substitute the regular UserName (culled from the file-system path) - for example, if Vista is in Workgroup Mode, it's not going to even try to look up the UserDisplayName in Active Directory. So maybe there's a regkey somewhere that will "fool" Vista into thinking it's in Workgroup mode - for times when the box is not connected to the network.
Quite often, Windows developers will embed keys like this, as little undocumented hacks to help them test certain scenarios, and for one reason or another (usually at the behest of a Marketing dweeb) that key remains undocumented or unexposed.
Mark has done a fine job of exposing the behavior of the system, but one does not always have to live with that as the default.
(Of course there are trade-offs - I think both of these possible workarounds would likely be security risks.)
Foolhardy: AMEN. This has frustrated me for years.
It feels like MS are forever extending the same tired 10-plus-year-old design.
I'm sure much of the code has improved beyond recognition, but still it retains this awful clunkiness.
A serious revamp is overdue.
Sean McLeod hit the nail on it's head.
A domain related API should have fail quickly when there is no visible domain connection. So Mark's investigation actually gives a strong alibi to the shell, and the culprit is some underlying layer of smartness in the base network that should detect connection to the domain. It can be that layer or feature even does not exist yet, or haven't made it to RTM.
Instead, there is a primitive try-and-cache logic.
The computer obviously was connected in some way (to the public internet or whatever) - so the media connection state by itself isn't very useful.
Very interesting post, but on my machine (Vista RTM) I am seeing a delay but it's < 3 seconds on first invocation and no delay on subsequent invocations. What is your Windows Experience Index? Mine is 3.0 for my laptop. Are running in a VPC by any chance?
Augh! Synchronous shell again! I never cease to be irritated how this allegedly multi-tasking OS is frozen by, say inserting a CDROM. It's rubbish. And monumentally lazy on the part of the planners. Wasn't Vista supposed to be a drains-up rework of XP?
Its re-assuring to see you posting this kind of work while employed by Microsoft. I imagine I'm not the only one that wondered if you would no longer be able to post things that are embarrassing to your employer.
Its interesting to see that taking a domain member workstation away from the domain still has problems and delays. I suppose this is a demonstration of the evolution of the operating system. Had it been designed for such it would no doubt have a storage cache for all of the information it might need while offline so that these calls don't waste time.
As mentioned above its kind of surprising that there isn't a fast bypass to these checks when there is no attached network. That would make sense, but I supose that would simply be another hack to get around previously unplanned usage.
Thanks again Mark. Always a good read.