Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
Update: The Active Directory team has released useful guides for troubleshooting slow logon issues:
Emails containing troubleshooting cases keep arriving in my inbox. I’ve received many cases that start with a seemingly unsolvable problem and end a few steps later with a solution or - often just as useful - a workaround. I’ve amassed several hundred such cases that I’ve captured in over 400 PowerPoint slides, giving me great material from which to draw for my blog and the Case of the Unexplained talk series I’ve delivered at a number of major industry conferences
I’m always looking for fresh cases, use of obscure tool features, and unique troubleshooting techniques, so please keep them coming. This time, I’m sharing a fascinating case that highlights two useful techniques: comparing Sysinternals Process Monitor logs from working and problematic systems, and using Sysinternals PsExec to capture activity during a logon.
The case begins when a systems administrator at a large company got multiple end-user complaints that logon was taking over three minutes. The users didn’t encounter any problems once logged on, but the delays were understandably frustrating. Many other users running with the same software configuration weren’t experiencing issues, however. Looking for commonalities, the administrator queried the network configuration database and, sure enough, saw that all the systems with complaints were Dell Precision 670 workstations. He thought he had a major clue until he looked he saw that the systems running without issue included seemingly identical 670 workstations.
Looking for clues more directly, his next step was to try to analyze the logon process of the delayed systems. He used PsExec to run Process Explorer in the Local System account so that it would survive a logoff and be active at the next logon. Because the systems were running Windows XP, the command-line he used was the following (see the end of the post for how to do this on Windows Vista and higher):
psexec –sid c:\sysint\procexp.exe
The “-s” directs PsExec to launch the process in the Local System account, “–i” to connect the process with the interactive desktop so that its windows are visible, and “-d” to return immediately instead of waiting for the process to terminate. Note that if you have Fast User Switching enabled and you are not logged into session 0, do not log out, but instead switch users, login to the problematic account, and then switch back to the session from which you started PsExec.
At the subsequent logon, he noticed that Lisa_client_7.0.0.0.exe, the company’s own system inventory line-of-business (LoB) application, consumed CPU for a short time, went idle for three minutes, then exited, after which the logon process would continue as normal:
David Solomon coined a phrase back before Process Monitor replaced Filemon and Regmon that still applies when updated: “When in doubt, run Process Monitor!” (I follow this advice religiously, even having my daughter run Process Monitor when she comes to me with a homework question). This case is a great example of that philosophy put into practice because it seems unlikely on the surface that Process Monitor would reveal the cause for a process hang, but the administrator turned to the tool nonetheless.
After launching Process Monitor with PsExec and capturing a logon trace, he scrolled to the beginning of the captured data and started his analysis. Because of what he saw in Process Explorer, the Lisa_client process was the obvious suspect, so he right-clicked on its process name in one of the trace lines and selected the Include quick-filter menu item to remove from the display entries related to activity from other processes:
When troubleshooting a hang with Process Monitor, you should first see if there are any gaps in operation time stamps that match the hang duration. You can look for lengthy operations by adding the Duration column to the display and then making sure to filter out operations that commonly don’t immediately complete, like directory change notifications. That can be useful when you don’t see a significant time gap between operations because the process has multiple threads, some of which continue to operate while the one causing the hang is dormant.
To his pleasant surprise, he soon found an event that not only preceded a gap of exactly three minutes, but that had an unusual result code, IO DEVICE ERROR:
It appeared that the Lisa_client process performed a SCSI pass-through command to the disk hosting the C: volume that timed-out after three minutes with a hardware error. Wondering what the result of the command was on one of the 670’s that logged on promptly, he captured a trace from one and saw that the corresponding operation took less than a millisecond and was successful:
The evidence clearly pointed at a hardware issue with the disks installed on a subset of the 670’s, so he gathered disk type data from all the 670’s, correlated them with the reports of slow logons, and found that all of the slow systems had Seagate disks and the others had Fujitsu disks.
His company was obviously not going to replace disks just to avoid an issue being caused by its own LoB application, so he had to figure out a workaround. He notified the Lisa_client development team of the issue, who reported that they could remove the command without loss of functionality, but that it would take at least several days for the update to go through their internal release process. Having a few days where system information wouldn’t be collected for a subset of systems was less important than end-user productivity, so in the meantime he wrote a WMI logon script to query the system disk and launch Lisa_client only if it wasn’t a Seagate model.
Without Process Monitor’s help he would have probably determined that the disks were the key hardware difference, but it’s not clear he would have discovered the root cause and been able to work around it rather than resort to replacing disks. This is yet another case solved with the help of Process Monitor and insightful detective work.
In closing, I mentioned that I would provide steps for configuring an application to survive logoff and logon on Windows Vista, Windows Server 2008 and higher. The PsExec command I supplied for Windows XP won’t work on newer operating systems because Windows Vista introduced Session 0 Isolation, requiring a couple of extra commands to make the launched application accessible after a logon. First, start the utility in session 0 with this PsExec command in an elevated command prompt:
psexec –sd –i 0 c:\sysint\procmon.exe
You’ll see a window titled “Interactive services dialog detection” flash in the taskbar, indicating that a process is running with a window on the hidden session 0 desktop. Click on the taskbar window to restore the notification dialog and then on the Show Me the Message button to switch to that desktop:
The utility you launched will be visible there and you can configure it with desired settings (it’s running in the Local System account so won’t have your own account’s defaults). When done, click on the Return button to get back to the main desktop. You can now logoff and log back on to reproduce the problem you’re investigating. After logging on again, execute the following commands in an elevated command prompt to cause the doorway to the session 0 desktop to reappear:
net stop ui0detect net start ui0detect
Go back to the session 0 desktop to look at the captured information and close the tool.
One last thing I want to leave you with is a reminder that I’ve documented many other troubleshooting cases in this blog and you can find them in the blog index here. You can also watch recordings of my Case of the Unexplained sessions from TechEd here and be sure to come to TechEd US this June in New Orleans, where I’ll be delivering it again with all new cases.
I am a little confused. The problem appears to be related to a domain based logon issue, although this is not stipulated.
Perhaps I have missed something, but if a Windows XP system is "joined" to a windows domain, the ability for Fast User Switching is disabled and not accessible.
Can this be enabled, or is this procedure only applicable to systems configured in a workgroup and not a domain?
Otherwise an excellent article and application of these excellent utilities.
OK so this is good for capturing info when logging ON but I've currently got a user with a problem logging OFF - each time they try to log off or shut down the system gets as far as the "Logging Off" stage and never gets any further. It doesnt crash or freeze, just sits there on Logging Off forever (Windows XP, SP3). It even happens in Safe Mode!
So I'm thinking I can use this trick of running Process Monitor under the Local System account and have it log activity during the log off, to try to get some more information. However I have just tested this on a virtual XP workstation and simulated the problem by just stopping the VM when it gets to the Logging Off stage (as we have to just power the laptop off and reboot it once it has got stuck on Logging Off) but when I boot the VM back up to look at the Process Monitor log file I cant open the file because it says "The file xxxxx.PML was not closed cleanly during capture and is corrupt". Is there any way I can view the captured information or any better way of doing this? One alternative I am considering is using Process Explorer's 'Replace Task Manager' option and then trying to start that when the machine is in the Logging Off state (as I can still start Task Manager at that point).
PS sorry for cluttering up your blog comments with a request for help, it just seemed very related to this topic. I'll post the problem on the Sysinternals forum if people think thats a better idea (already posted it on technet forums a few days ago but not got much help yet)
Thanks
Chris
For anyone that's interested, I resolved our logging off problem (mentioned in the comment above)... by uninstalling .NET Framework 1.0!
I fired up Process Explorer when the machine was stuck in the Logging Off state and that showed me that the winlogon.exe process was tagged as a .NET process (highlighted in yellow) so I thought I would see what happened if I took .NET off completely and voila, the laptop shut down without a problem as soon as I took it off. Luckily the user does not need .NET 1.0 any more so problem solved.
For the folks with slow initial logins, I have two possibly-silly questions: Are Offline Files enabled? And is CSCdll.dll (offline network agent) doing something anyway?
I've seen situations where CSCdll.dll was going into what appeared to be an infinite loop in winlogon.exe even when offline files was disabled. This was flooring a single core of the processor even after logins finished. On multi-core systems it was surprisingly easy to miss, but on single-core you'd get the responsiveness you'd expect from a processor at 100% utilization.
Thanks for interesting article! Seems relevant to my problem of random (but quite often) slow logons between 3 and 3,5 minutes on XP SP3 computers in a domain (No FUS).
However, like Chris and others I don't understand how to make the monitoring tools to survive logoff!
So please tell us exactly; how do we open the Process Explorer / Process Monitor AFTER logoff/new logon with the information from the logogg /new logon process still intact???
A complete step-by-step list may do the trick for the less intelligent among us :o)
Does anybody have an answer for the slow shut down and slow logging off that Chris asked a few messages above?
Process Monitor is killed before my computer starts to hang. It usually starts to hang at "windows is shutting down" and "closing network connections".
Any suggestions??
Oh Mark Russinovich, where art thou..?
Your followers still needs some answer, as you can see.. :o)
Wondering if anyone can provide some insight and recommendations for troubleshooting slow logon experienced on first logon of the day with XP SP2 and XP SP3 laptops.
If the laptops goes offsite and connects via Juniper VPN to the domain, then upon first logon of the day it takes 3-4-5 minutes. If the laptop does not go offsite and stays connected to the domain the first logon of the day is around 2 mintes. I have attributed 30 seconds of that 2 minutes to the fact we have Home folders for each user and when disabled the logon time goes to 1 1/2 minutes. I can live with that portion of the delay.
Wondering if anyone else has experienced this similar issue. Trying to use Procmon, Netmon and debug logs but not tracking down exactly what is happening. Getting frustrated to say the least, this has been going on for a very long time and I haven't been able to pinpoint the root cause. Have been able to identify machine specific items that contribute to 5-6-7 minute delayed logon instances but haven't located the silver bullet affecting all laptop users throughout the domain when they roam offsite and connect through VPN...and I truly believe there is one hiding out their somewhere.
Any recommendations or resolutions with similar issue would be extremely appreciated.
Thank you,
"Mr. Slow Logon" (literally)
Mr. Slow Logon, have you found any conclusive proof that Juniper is the culprit? I'm at the same point as you at the moment and it's frustrating. I've zero'd in on Juniper as the cause of the symptoms also.
Has anyone shared any further information with you offline that you'd be willing to share? Specifically, how to resolve the issue while still using Juniper as a VPN client?
I REALLY appreciate ANY information you may have...
Thanks!
@Juniper's fault?,
Good morning.
We have not found any conclusive proof that Juniper is the culprit but I am willing to consider and focus on troubleshooting anything at this point. Just to be clear, are you saying you are experiencing the same exact issue:
1. Users going off the domain
2. Users then using Juniper to connect remotely
3. User disconnects from Juniper
4. User coming back to a physical domain connection
5. User experiencing extremely slow first logon to the domain
6. Subsequent logons during the same day show a significant improvement
Unfortunately, no one has shared any additional information regarding the specific issue I have posted(ie. described above), you are the first to respond with the same issue.
Can you provide some additional detail as to the exact symptoms you are experiencing and maybe together we can bring this to a resolve or at a minimum zero in on the culprit and determine if it is related to Juniper!
Thank you for the reply...this is now making me think of a whole other subset of potential causes.
Mark, You are a superhero of the IT field. Thank you for answering the Bat-Signals.
For those having trouble running procexp.exe at login...this worked for me
Windows 2003 AD Env./WinXP Clients
Requires two networked comptuers:
1. Install Process Explorer on remote computer (computer experiencing slow login). Install Path=c:\sysint
2. Remote computer status, idle @ CTRL+ALT+DEL screen.
3. From another computer on the network...Run:
psexec \\remotecomputer -sid c:\sysint\procexp.exe
4. Log in to remote computer with any user.
Hope this helps! Cheers, Fearless
@ Mr. Slow Logon,
In our case, the issue begins to occur as soon as the Juniper client is installed. After installing the client, after a few reboots the slowness begins.
It seems to be the first logon, so long as the machine isn't rebooted. From what I can tell, multiple important services don't start as quickly as they used to, such as WMI and certain network related services. (I don't have a complete list- the WMI service alone was enough to cause alarm in our environment.)
Strangely enough, if I uninstall Juniper, make sure the services are all removed, and clean out the registry- the issue doesn't go away. It doesn't START until I install Juniper, but it takes a re-image to make it go away once it starts.
@Juniper's Fault
Sounds like we may be experiencing different issues. As long as our laptops never hit the VPN, the logon times,while not lighting fast, are not significant. Once they connect VPN and come back to the physical domain connection, the first logon of the day is delayed. Subsequent logons during the same day are normal.
To Mark and the rest of the group:
I ran ProcMon and Enabled Boot Logging on a laptop which had been off the physical network for approx. 10 days. During those 10 days, it connected via VPN. The ProcMon capture is approx. 900MB. I have applied a filter as suggested by Stephan on his Friday, January 22, 2010 12:38 AM posting. This takes the capture down to 7.6 MB. I have reviewed the capture over and over but I am not finding where the delay is residing. I was wondering if there is anywhere I can upload the filtered capture to be reviewed or if anyone can provide any suggestions of how to begin eliminating certains events.
My experience:
XP SP3.
psexec –sid SysIntApp.exe
procexp.exe work after logoff
Procmon.exe (1.26, 2.8) closed on logoff
Filemon.exe (6.2) work after logoff
To Mark: Tank you for your job
Disable WEBCLIENT Service first.
Read the following article:
http://support.microsoft.com/kb/832161
Hope it helps,
Lucas.