-
So one of the questions that comes to mind every now and then in technical circles (and outside as well) is "Why doesn't it take so long for my machine to boot?". Just what's going on in there while these friendly, soothing graphics come up on the screen, and I wait and wait for a prompt to login?
Great question. I recently purchased a solid state drive for my laptop and after imaging it with Windows 7 and loading it all up with drivers and whatnot, I had the same question. So I went off an looked to find out what the 'deal' was.
So I went to the Windows Performance Analysis website and downloaded and installed the Windows 7 SDK, which includes the Windows Performance Toolkit (mainly, for this exercise, xbootmgr.exe and xperfview.exe.
(So xbootmgr will tell the kernel to start tracing at boot and tell Windows to restart so it can get on with the trace. So be prepared for the system to reboot you when you type this in and hit enter!)
Anway, after download and installation, I did the following from an elevated command prompt:
xbootmgr -trace boot -traceflags BASE+CSWITCH+DISK_IO_INIT
I did this in a directory where I had room for a couple hundred meg etl trace and it was nice and tidy so I didn't have to hunt for anything.
My system rebooted and as soon as I was presented with a command prompt, I logged in. After the shell came up, I had a window on my screen that basically counted down post boot tracing, for 120 seconds. When I see this I just let it do its thing.
After 120 seconds, it wraps all this data into an etl file named boot_BASE+SWITCH+DISK_IO_INIT.etl in the directory where I ran the command prompt.
After the system is done collecting its data and waiting on the prefetcher info and whatnot, I then go in and do the fun bit, open the ETL file with xperfview.
I immediately am drawn to the wide gap where nothing happens in my services list, from the 22 to 38 second mark. Turns out this is the ambient light sensor for my keyboards backlight. It takes the driver a bit of time to figure out the ambient light where I'm at to make a judgement call on if a backlit keyboard is needed. In Windows 7 there is a handy feature for services called "Automatic (Delayed Start)". I put the service into that state and rebooted and I saved 16 seconds on my boot time. A decent gain I think.
Now, this is a very, very rudimentary explanation of how to review an ETL file, something simple to look for, a beginners example. I highly recommend going further with ETL / WPT, as it is a very insightful glimpse into Windows system performance. To dig further, I've collected some links from a list that is floating around internally...
Performance toolkit (XPERF) log & analysis
The required steps to collect xperf logs on XP / Vista are as follows:
1) Download & Install the toolkit on a Vista/2008/Windows 7 machine.
The latest version of the Windows Performance Toolkit is part of the Windows 7 SDK (which is a huge download). The following blog has the steps to download the ‘bare minimum’ to get the WPT toolkit.
http://blogs.msdn.com/jimmymay/archive/2009/11/24/xperf-install-windows-performance-toolkit-wpt-with-242mb-download-not-2-5gb-windows-7-sdk-part-2.aspx
2) Copy the contents of the “C:\Program Files\Microsoft Windows Performance Toolkit” to a folder on Windows XP (or a USB memory stick).
3) Turn off the “No Execute” or ‘Execute Disable” security option for CPU in the BIOS (or if you cannot find the appropriate BIOS setting, add the following switch to the boot.ini file: noexecute=alwaysoff)
4) Either run xbootmgr from the command line, or use the XPerfUI utility which you can download from our codeplex website: http://xperfui.codeplex.com/
5) Copy the resulting .etl file to the Vista machine to use the xperfview GUI to open & analyze it. If a userenv log is generated under %windir%\debug\usermode, it can also be copied to correlate processes & times.
MSDN documentation link for the Windows Performance Toolkit:
http://msdn.microsoft.com/en-us/library/cc305187.aspx
Windows On/Off Transitions Solutions Guide (Diagrams)
http://www.microsoft.com/whdc/system/pnppwr/powermgmt/OnOffTrans.mspx
On/Off Transition Performance Analysis of Windows Vista (Vulnerabilities)
http://www.microsoft.com/whdc/system/sysperf/On-Off_Transition.mspx
Xperf UI – GUI wrapper for the Xperf command line tool
http://xperfui.codeplex.com/
Also a good blog for more information
http://blogs.msdn.com/pigscanfly/pages/xperf-articles.aspx
Two Minute Drill: Introduction to XPerf
http://blogs.technet.com/askperf/archive/2008/06/27/an-intro-to-xperf.aspx
More notes on xperf:
To show if there are any active loggers
Xperf –loggers
To stop any active loggers
Xperf –stop
To view help on available flags
Xperf –providers i
Xperf –help providers
To trace any process ad hoc including cpu, disk and registry
Xperf –on diageasy+registry
<let the activity happen>
Then stop and merge the wmi / etl data into the log file
Xperf –d mytrace1.etl
To view the traces; (only works on Vista or Svr 2008 or later)
Xperf <logname.etl>
Or
Use xperfview as the GUI
Special thanks to Fatih Colgar and Roger Southgate for comprising the "Performance Toolkit (XPERF) Log & Analysis" links and walk through.
-
http://visualstudiogallery.msdn.microsoft.com/en-us/e8649e35-26b1-4e73-b427-c2886a0705f4
So, check this out. It allows you to collect ETW tracing, dumps, all kinds of stuff. It does not work on Windows XP, but still, a handy little tool nonetheless.
-
In the Vital Signs workshop, we touch upon the tool SPA (Server Performance Advisor). This unsung hero of performance evaluation deserves some love, which is why I'm writing about it over 5 years after its last update was published and made available on the downloads site, here:
http://www.microsoft.com/downloads/details.aspx?FamilyID=61a41d78-e4aa-47b9-901b-cf85da075a73&displaylang=en
So, Clint Huffman, creator of PAL, wrote up this excellent article on how to troubleshoot server performance problems...
So, check it out here:
http://channel9.msdn.com/Wiki/PerformanceWiki/HowToIdentifyBottleneckSPATool/
It's the bomb, and it's free as in beer.
-
I've had several customers complain that the WPT was only available after downloading and installing the Windows 7 SDK (which if you are on Windows XP still is a bit of a problem).
So, I've grabbed the latest release to internet for both x86 and x64 and provided links to them below.
Some introduction to usage and analysis can be found here. Happy troubleshooting!
-
So I'm a PFE now, Premier Field Engineer. It's an interesting gig, sort of like running your own company within Microsoft. I'm doing Platforms now, instead of Exchange. Trying to keep the mind limber and all.
So far I've been doing shadowing of other PFEs as they do things like Active Directory Risk Assessments and what have you.
I'm looking forward to helping our customers proactively instead of being in a constant reactionary state of crit sits and whatnot.
More on this soon.
-
Applies to Exchange 2003, concepts apply to 2007 as well.
I've bumped into a few cases recently where the customer had unexpected transaction log file growth that caused the server to dismount a storage group due to lack of disk space. In this post I'll attempted to explain why this occurs, how to troubleshoot it, etc.
The short of it is transaction log file growth usually occurs because of a repeating transaction. It can be a looping message, a mis-behaving client, or a corrupt message. Looping messages I've seen done by users setting up special things on their Outlook clients. Consider the following example:
A user leaves for the weekend. They are expecting an important email, so they put in a forward rule to forward all email to their mobile phone's email address. They either 1)mis-type that address, or 2)their phone's email box doesn't accept messages above a certain size. In the event of 1), every message sent to the user is going to hit the mail servers of the phone provider and bounce with an invalid address. This NDR will come back and hit the mailbox of the user, where the forward rule will forward the NDR to the phone, which will bounce and come back to the inbox, where it will forward the NDR to the phone...... In the event of 2), any message above the size limit will trigger the loop above (unless the ISP's mail server knows not to append the offending email as an attachment to the NDR).
This is a real world example I've personally run into. Users can and will do all kinds of bizarre things that under the light of day seem obtuse, but in the heat of the moment make sense.
So how do you track this down?
The normal troubleshooting path I take for this type of problem is:
1. Run Exmon. Tell me if a single user is taking something silly like 50% of the servers resources. If you're spooling out transaction logs like it's nobody's business and Exmon shows a user at 50%+ and they are in the same Storage Group as the spooling transaction logs, then chances are you've found your man. If Exmon doesn't point anything out of the ordinary, then proceed to step 2:
2. Go to your Exchange System Manager, drill down to the Storage Group that you're seeing the transaction log growth on. Expand each database and visit the logins area. Add columns for MSG Ops, Folder Ops, Total Ops, and sort by high/low and see if you have one user towering above the rest. Do this for each database. If you've got a single user standing out, again, this is very likely your culprit. Log into their mailbox, see if there is something stuck in the Outbox, or check their active client for any client-side rule that may be at fault. Worse comes to worse, disable the user's mail.
3. User Scott Oseychik's guide on Transaction Log analysis to figure out what the offending message might be:
http://blogs.msdn.com/scottos/archive/2007/07/12/rough-and-tough-guide-to-identifying-patterns-in-ese-transaction-log-files.aspx
This is an excellent guide and needs no further clarification.
4. If this doesn't work out for you at this point, call into support, it could be a problem with a mobile device syncing or an OWA session trying to process a corrupt message (I've seen both scenarios). Only a series of store dumps collected with adplus will tell us that.
I hope this helps in your troubleshooting efforts.
-
Applies to Exchange 2003
Event Type: Error
Event Source: MSExchangeIS
Event Category: General
Event ID: 9646
Description:
Mapi session "/o=<org>/ou=First Administrative
Group/cn=Recipients/cn=<userName>" exceeded the maximum of 32 objects of type "session".
Seen these in your environment? Sometimes they are caused by desktop search engines opening too many MAPI sessions, other times they are because your Exchange server is keeping open connections from the client that the client for whatever reason thinks are closed.
For example, say you have users connecting to Exchange over a poor network connection or VPN. When Outlook connects, it establishes MAPI sessions. If the users drops the VPN connection without closing Outlook first, those connections are going to stay open on the Exchange server for 2 hours. If the user connects again, more connections get added to the Exchange server for that user. See where we're going with this? If you max out, new connections will fail, resulting in an unhappy end user.
So how do you fix this?
Well, one of three ways.
You can try to prevent your clients from doing so many connections by educating your users, making changes in Outlook 2007, correctly configuring Network Accelerators to not keep connections open, etc.
You can tell Windows/Exchange that 2 hours is far too long to keep a session open without activity. To do this you follow the instructions in this document, specifically the TCP KeepAliveTime, set that to 5 minutes.
KB324270 How to harden the TCP/IP stack against denial of service attacks in Windows Server 2003.
http://support.microsoft.com/default.aspx?scid=kb;EN-US;324270
Or finally, and the last resort, you can add additional allowable sessions by following KB842022. Note that this is a last resort, as in many cases you are merely delaying your pain for later. Note the warning: "If you do this, try to determine the minimum value that you can use so that the client program can run without problems. If you raise the limit too high, the client program might affect the performance of the Exchange Server computer."
Event ID 9646 is logged in the application event log of your Exchange Server 2003 computer when a client opens many MAPI sessions.
http://support.microsoft.com/default.aspx?scid=kb;EN-US;842022.
-
Applies to Exchange 2007
So I ran into a case today that was pretty interesting. The symptoms were a fairly generic error in the app log:
Event ID: 4001
Task Category: General
Level: Error
Keywords: Classic
User: N/A
Computer: server name
Description:
A transient failure has occurred. The problem may resolve itself in awhile. The service will retry in 56 seconds. Diagnostic information:
Kerberos test. . . . . . . . . . . : Failed
[FATAL] Cannot lookup package Kerberos.
The error occurred was: (null)
And in the Exchange Management Console, all the databases reported back a status of "Unknown". Also you couldn't run the EXBPA on the server, it came back with a network/registry error:
Error (The network path was not found) opening registry key reg:/servername/HKEY_LOCAL_MACHINE/Software\Microsoft\Windows NT\CurrentVersion, skipping object.
The customer could run the EXBPA against the server remotely, and remotely in the EMC the database status came back Healthy instead of Unknown.
Weird huh?
Turns out that there were corrupt / bad entries for the machine name in the hosts file, and it was causing all three symptoms. #'ing the records and doing an ipconfig /flushdns resolved everything in just a minute or two.
-
Applies to Exchange 2003
I had a case a couple weeks ago I thought I'd write about. What was happening is the Version Store would run out of memory and a 623 error would throw. Version Store buckets allocated would climb from 4 to over 2000 in less than 5 minutes. The store would then rollback its transactions for a bit, recover, run for 10-15 minutes and repeat the whole cycle over.
This is atypical 623 behavior to say the least.
What we ended up doing to fix it was capture an adplus dump, 3 actually, triggered at Version Store buckets allocated crossing 1600. We captured 3 dumps at 1 minute intervals.
The 1st dump caught the problem transaction, the last 2 were both capturing rollbacks, so this was a quick ramp up.
Turns out the problem was being caused by a bad meeting request being processed over and over again. We tried all kinds of ways to delete the message, all of which caused Version Store buckets allocated to climb. A MFCMapi hard delete ended up doing the trick.
-
Dan and I and some other engineers wrote up a blog post you can find here on how to recover from a smashes schema scenario on your Exchange Servers.
It's pretty succinct so I don't have anything to add to it, it's an interesting read though.
-
Applies to: Exchange 2000/2003
This may seem like a basic thing to some people, but for those who don't know, here goes. This can be easily done by running Exmon, available here.
So download Exmon and fire it up on your Exchange Server.
Go to the By Clientmon tab, and in there you'll see a column named "Cached Mode Sessions". If you have something other than 0 in that field, then your user is connecting over Cached mode.
Hope that helps, I've had the question a few times before.
-
Ok, so if you have say, Citrix, or a standard image with Office pre-installed, then someone had to pick an Exchange server to point to for the Outlook profile creation wizard.
So sometimes, in large organizations, teams don't necessarily speak to one another before they make small decisions like which server to point to. The person creating the Office install might pick, say, his home mail server.
So when that mail server, years later, gets decommissioned, this can suddenly cause problems.
How do you fix this?
Simple! Glad you asked.
2 things need to be done.
1. Establish IP connectivity to the old server name. Easy enough, go into DNS and create a new A record for the old/missing Exchange server, with the IP of the server you'd like this task to point to.
2. Go into ADSIEdit, find the computer object for the target server, right click and hit properties. Scroll down to ServicePrincipalName and edit. Add the following type of record:
exchangeRFR/servername
Give that a little time to replicate around and voila, everything goes back to normal.
Why is step 2 necessary? Kerberos security rearing it's ugly head. The target server needs to know it's acting as the old server or it will refuse connections.
Note that this is a possible work around and may cause corrupt MAPI profiles on your clients. The real fix here is to address the install, or clients configured to a server that no longer exists.
-
Applies to Exchange 2000, Exchange 2003, Exchange 2007.
So in Version Store 623 errors, Version Store gets 'clogged', if you will, and will fail to process transactions.
624 errors on the other hand, are caused by a lack of available virtual memory on the server. Sometimes this has no impact and the server corrects itself, but in a memory leak condition, this can be the sign your Exchange server is no longer accepting client connections and is in need of some assistance.
In the particular instance where I have seen this occur, the 624 event comes after a series of errors:
First we throw a MSExchangeDSAccess 2104 event.
Event ID : 2104
Raw Event ID : 2104
Record Nr. : 4802384
Category : None
Source : MSExchangeDSAccess
Type : Error
Generated : 9/7/2008 12:27:27 PM
Written : 9/7/2008 12:27:27 PM
Machine : JAHUMBALABAH
Message : Process STORE.EXE (PID=636). All the DS Servers in domain are not responding.
Shortly thereafter you'll see a MSExchangeDSAccess 2102.
Event ID : 2102
Raw Event ID : 2102
Record Nr. : 4802387
Category : None
Source : MSExchangeDSAccess
Type : Error
Generated : 9/7/2008 12:28:15 PM
Written : 9/7/2008 12:28:15 PM
Machine : JAHUMBALABAH
Message : Process MAD.EXE (PID=2588). All Domain Controller Servers in use are not responding:
JAHUMBALABAH-DC
Then we will see a MSExchangeSA 9152.
Event ID : 9152
Raw Event ID : 9152
Record Nr. : 4802391
Category : None
Source : MSExchangeSA
Type : Error
Generated : 9/7/2008 12:31:15 PM
Written : 9/7/2008 12:31:15 PM
Machine : JAHUMBALABAH
Message : Microsoft Exchange System Attendant reported an error '0x8007000e' in its DS Monitoring thread.
This particular error is an out of memory error. Uh oh.
Then DSAccess has another problem.... a 9154.
Event ID : 9154
Raw Event ID : 9154
Record Nr. : 4802392
Category : None
Source : MSExchangeSA
Type : Error
Generated : 9/7/2008 12:31:20 PM
Written : 9/7/2008 12:31:20 PM
Machine : JAHUMBALABAH
Message : DSACCESS returned an error '0x80004005' on DS notification. Microsoft Exchange System Attendant will re-set DS notification later.
This means a call failed, due to lack of memory...
Then the error you've all been waiting for, a 624 gets thrown by ESE.
Event ID : 624
Raw Event ID : 624
Record Nr. : 4802473
Category : None
Source : ESE
Type : Error
Generated : 9/7/2008 12:32:58 PM
Written : 9/7/2008 12:32:58 PM
Machine : JAHUMBALABAH
Message : Information Store (636) Storage Group 1 (First Storage Group): The version store for this instance (1) cannot grow because it is receiving Out-Of-Memory errors from the OS. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.
Current version store size for this instance: 1Mb
Maximum version store size for this instance: 249Mb
Global memory pre-reserved for all version stores: 1Mb
Possible long-running transaction:
SessionId: 0xBD345AC0
Session-context: 0x00000000
Session-context ThreadId: 0x000015AC
Cleanup: 1
So what can cause this? Check your task manager. Do you see any handle leaks or processes with out of control handles? In the instance I saw for this, it was a mixture of stale messages stuck in the SMTP temp tables and a third-party AV scanner that had an apparent memory leak. Both Inetinfo and Store were over 2 gig and had 32k handles each. Once we resolved the issue Store was around 6k handles and Inetinfo around 3k.
What is happening is a memory leak is consuming all the virtual memory space in Store and Inetinfo, at least in our case here. Yours may differ in what is causing the leak, but I'd bet more than likely its going to be something that ties into Store, such as Anti-Virus, something gumming up IIS and then Epoxy, or something along those lines.
Because you run out of memory, DSAccess starts to fail, then you see the string of errors above.
If you see this, what should you do first and foremost? Give PSS a call so we can help you debug it.
More information on this can be found here:
http://technet.microsoft.com/en-us/library/bb218083(EXCHG.80).aspx
-
Applies to Exchange 2003
So one of the things that can go wrong with Exchange is that it can run out of something called Version Store. Version store is an in-memory list of changes made to the database. Nagesh Mahadev has an awesome post about Version Store on our msexchangeteam.com blog, posted here. To borrow his summary: In simple terms, the Version Store is where transactions are held in memory until they can be written to disk.
So version store running out of memory can be caused by either a long running transaction. This is pretty self explanatory. Say your anti-virus product wants to scan something in VSAPI and locks it and then goes to lunch. Your version store will consume more and more memory until it runs out because it's trying to work around this long running transaction, keeping track of all the rollbacks and whatnot.
The other problem is with I/O. Since we're holding transactions in memory until they can be written to disk, if something prevents us from writing to disk, we can hit version store problems. Sometimes this type of problem can be precipitated by 9791 event log entries in the application event log. If this happens, get ready to do some adplus store dumps when version buckets allocated hits 70%.
What to do to prevent or mitigate this risk?
- Consider increasing transaction log buffers, especially if you are seeing transaction log stalls in your environment. The logic here is that if store can't commit transactions to the log files fast enough, it can cause version store to back up. By default the number of buffers is 500, you can increase this to 9000. This will prevent a single database from needing to write a bunch of TLs at once and backing up version store. I highly recommend using the EXBPA for governance on this, details on the rule for setting this, etc can be found here.
- Watch your PTE resources and treat accordingly. I've seen customers run low on free PTEs and run into version store problems because they don't have the capacity to perform IO operations as fast as the database would like.
- Make sure your online maintenance is completing frequently, at least once a week on each database. Part of online maintenance is defragmenting your database. On a highly fragmented database(s) version store has to keep track unoptimized links and tables and dealing with records that are not on the fewest number of pages possible, in essence bloating version store size with each transaction. For indepth information on Exchange Store Maintenance, go here.
- Keep your message size limits down. Going hand in hand with this is preventing older Outlook clients from accessing your server. Old clients (Older than Outlook 2003 SP2 in cached mode, any version of Outlook 2003 and higher for online mode) ignore your message size limits for submitting messages, so older clients could attach a 100 meg file and submit and store would have to deal with it even though it's over the size limit. This should give you the gist of what I'm talking about here.
Hope this helps with your environment.
-
Applies to: Windows 2000 Server/Advanced Server, Windows 2003 32bit Server, Exchange 2000/2003
PTEs
Ok, so one of the most overlooked resources we run into with performance and availability problems is the availability (or lack thereof) of Free Page Table Entries. What is a PTE? It's basically an I/O partition table, if you will. Wikipedia has an awesome link with 8x10 color glossy photos, with circles and arrows and a paragraph on the back explaining what each one is, so I'll point you there. Cliff Huffman also has an excellent post on PTEs here that specifically talks about Windows.
So anyway, running out of Free Table Entries is bad, because it causes system hangs, sporadic lock ups, general unresponsiveness, etc. These symptoms present themselves in Exchange as general slow performance or service unavailability.
You manage your available PTEs in Windows with the boot.ini and also the SystemPages registry key. Generally speaking for an Exchange Server that is properly configured, you'll see your PTE values somewhere between 8000-16000. A large number of PTEs (50k or so) may be a hint that you're not using the /3GB switch on your server. A lower value generally means there is a problem.
This problem can either be a configuration issue, or if the PTE value is falling, a memory leak.
If you are dealing with a static low value and you've examined all the configuration settings and they all seem fine, but the value is still low (flagging in the EXBPA for example), then add /basevideo to your boot.ini. The new agp/pci-e video drivers consume a lot of PTEs, and who needs the super-duper video card drivers on an Exchange box anyway?
If you are dealing with a leak, update your drivers for everything, NIC, HBA, Video, SCSI controller, you name it, update it. If you've done all that and still haven't gotten the leak addressed, contact PSS to get one of us involved with your case.
Handles
Another resource people don't usually pay much attention to is handle count. Excessive handle consumption can cause all kinds of non-paged kernel pool problems because they reside within that memory space.
If you have the symptoms of a memory leak but don't see what is causing it, check out the handle count in task manager. You can do this by going to the Processes tab and selecting View/Select Columns and selecting Handles. Handle usage varies by application and what it's doing at the time, but if you have an application with 100k handles open and your machine performance isn't the greatest, you may be dealing with a handle leak. If you are, your non-paged pool kernel memory may also be high but not showing anything eating it up in poolmon. This is because the handles don't appear to be taken into account on the poolmon monitor in some cases, so high consumption of handles by a resource don't end up under the process tag.
If you have a process with a high handle count, contact the vendor.
Documents on PTEs:
The effects of 4GT tuning on system Page Table Entries
How to Configure the Paged Address Pool and System Page Table Entry Memory Areas
Documents on Handles:
Well, here you can see the impact of high handle count:
Microsoft KB