Dude where's my PFE?

I was a Premier Field Engineer (PFE) for Microsoft.

Dude where's my PFE?

  • How I stand up a new MDT environment, Part 1

    There are many MDT environment setups, this one is mine.

     

    Step .1  Pre-Flight

    I use a 2008 R2 SP1 Server as my base.  I enable the Hyper-V role on it and create 2 VMs.  Half of you reading this are going to say “But we’re a VMWare shop, we don’t run Hyper-V.”  Ok, so stand it up like this anyway.  It’ll make your life easier when you are creating your reference images to be using Hyper-V, where all the drivers are natively available instead of having to provide VMWare drivers that you are just going to strip out with a sysprep anyway…

    So, 2 VMs, setup as thus:

    VM1 is a Windows 7 x86 SP1 install.

    This VM I name MDT-Console, give it 2 processors and 2 GB of RAM, 50 GB of space and install WAIK and MDT.  I install Office 2010 and download the “Optional - MDT 2010 Update 1 Print-Ready Documentation.zip” to this VM.

    VM2 is a Windows 2008 R2 SP1 install.

    This VM I name MDT-Share, give it 2 processors and 2 GB of RAM, 200 GB of space (1 big volume) and create 2 File System Folders and Shares, “MDT-Reference” and “MDT-Deployment”.

     

    Step 1.  Establish our shares.

    To make things simple, I domain join my 2 VMs.  I permission the account I am using on the Windows 7 VM1 to full rights on the VM2 shares.  I then open my MDT console (I pin mine to the taskbar in this VM) and create the reference and deployment shares by clicking on “Deployment Shares” in the console and selecting “New Deployment Share” which invokes a wizard.

    image

    When I do a \\MDT-Share\ it autofills…

    image

    I then select my Reference Share, cause I want to establish it first.

    I name it (logically) MDT Reference Share in my console…

    image

    On the next 3 screens, I leave the “Capture” checkbox checked, because it’s a reference share, where we are going to do a lot of capturing. <next>

    I leave the Administrator question checkbox blank/default. <next>

    I leave the product key blank/default. <next>

    <next>

    image

    Awww yeah, now we’re cooking with gas.

    image

    To create the Deployment Share, I basically do the EXACT same thing, except I select and name it Deployment instead of Reference, and I uncheck the box to “Ask if an image should be captured”…

    Which leaves the console looking like this:

    image

  • How to collect a netmon 3.4 and xperf kernel trace and stop when a problem occurs.

     <updated 7.29.2013 for updated NETSH and WPR commands, thanks Carl Luberti!>

    So at a customer location the following question was posed to me:

    “In our VDI environment, how do we capture trace information?  How can we set a capture for the network and xperf the user can start (or logon script the start) and then give them a link to click when a random non-reproducible problem occurs?

    So what do we want, data wise?  A netmon 3.4 consumable trace, and an xperf trace including stackwalking.

    So I want to use the following commands:

    netsh trace start scenario=LAN,RPC capture=yes report=yes tracefile=<path to file\netmon.etl> maxSize=512 fileMode=circular overwrite=yes

    or

    netsh trace start scenario=LAN,NDIS,WFP-IPsec capture=yes report=yes tracefile=<path\file.etl> maxSize=512 fileMode=circular overwrite=yes

    (this will collect a network trace in ETL format (Netmon 3.4 can read this), generate an HTML report, trace to a circular logging etl file in a directory you specify and it cannot grow larger than 512 meg.  It will overwrite an existing log file to create a new one if need be)

    and

    xperf –on disk_io+dispatcher+latency –f <path to file\xperftrace.etl> –MinBuffers 1024 –MaxBuffers 1024 –MaxFile 1024 –FileMode Circular –stackwalk cswitch+readythread+threadcreate+profile

    or

    wpr -start GeneralProfile

    (this will collect an xperf trace to the path specified, buffering a bit of memory, restricting the file size of the xperf trace to 1GB and again it is a circular log, with stackwalking enabled)

    We place these two in a batch file, have the batch file run as administrator when opened, and the user is educated to double click this at logon, we place it in the logon script, etc.  Delivery method doesn’t really matter as long as it happens before the user starts working.

    When/if the problem reproduces, we run a second batch file that also runs elevated that has the following commands:

    netsh trace stop

    xperf –d <path to file\merged.etl> (or wpr -stop <path\filename)

    Note:  There is a caveat.  If you are tracing (stackwalking in particular) on 64 bit Windows, you must set “DisablePagingExecutive” in the registry to 1.  This command will do that:

    reg add “HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management” /v DisablePagingExecutive /d 0x1 /t REG_DWORD /f

    or

    WPR -disablepagingexecutive on

    Either command requires a reboot to take affect though!

    Then we can have the user notify us a repro of the issue occurred, collect the files that were logged and analyze Smile.

  • What do you do when you want to run Windows XP on a Solid State Drive?

    1.  #1  THE BIG UNO!  Properly align your partition with diskpar/diskpart and THEN install your XP build into that aligned partition.

    This link has a great explanation and how-to so I'm not going to re-invent the wheel. 

    I’m not certain I agree with the rest of the proposals there necessarily, I’m thinking more in line for Enterprise level customers rather than gamers here.  Though the RAMDRIVE idea is interesting Smile.

    Also!  Very important, use AHCI or SATA RAID or IRRT rather than IDE Compatibility mode in BIOS.  This is huge.

    2.  In Windows, disable the following, some to improve the functional life of the drive in terms of wear fatigue and some for performance:

    Windows XP Prefetcher

    Any Windows Defragmentation

    System Restore

    8.3 Filenames (unless you have a legacy app that requires them)

    File “Last Accessed” time

    Set the system to use “Large System Cache”

    3.  Unlike many posts on the internet, leave Windows Indexing Service alone, quoting some smart folks:

    ”It doesn’t turn off and shouldn’t be turned off specifically on SSD’s.   If the user doesn’t want the feature (it can be great and several features like explorer’s search and OL search utilize the index), then they can disable it.  It shouldn’t be in this recommendation this way.”

    4.  XP does NOT support TRIM.  So run the Garbage Collector provided by the OEM of your drive on a periodic basis (if recommended by the vendor, most likey YES)

    5.  Keep your Intel/AMD SATA controller driver up to date.  Keep BIOS up to date.  Keep firmware on the SSD up to date.

    6.  Use this to verify you are track aligned if you wish to verify your process works:  http://www.techpowerup.com/articles/other/157

    Hope this helps in your SSD efforts.  Bottom line though, strongly consider deploying Windows 7 on SSD based systems.  There are performance gains in 7 and it natively detects non-rotating media for disabling the defragmentation task from working, and disabled Prefetch for you…

  • High Privileged Mode CPU on 1 of 64 cores–quick fix

    I was contacted recently about a server with 64 cores that, with no work load, had one core, ordinal 34, running very high on CPU.  Looking closer, it was all being used by process ‘System (4)’.

    So that’s a fun one.  I had them collect an xperf trace using the following command:

    xperf –on base+cswitch+disk_io_init+latency+dispatcher –stackwalk threadcreate+readythread+cswitch+profile –f kernel.etl –buffersize 1024 –maxbuffers 1024 –maxfile 1024 –filemode circular

    and then stopped the xperf trace after collecting the high CPU at idle for a few minutes.

    I then opened the kernel.etl file in xperfview.exe and got to work.

    I was looking for what System was doing on core 34, so I had some hints to get started.

    I went and verified in CPU Sampling by CPU that CPU was in fact hittin 100% on core 34:

    clip_image001

    I then drilled down in “CPU Sampling by Thread” to see which thread it was:

     

    clip_image001[6]

    Turns out it was thread 372.

    This thread was busy in WmiApSrv.exe and wmiprov.dll, but also a lot of work being done in ntoskrnl and storport:

    clip_image001[10]

    I looked at their version of storport.sys, it was stock for 2008 R2 (not SP1).  So I suggested they apply http://support.microsoft.com/kb/981208 to address storport known issues.

    When that was applied, the problem went away and they were back in business.

  • 33 Seconds lost at boot…

    Imagine you are an enterprise customer who has several thousand desktops that boot slow.  Inordinately slow…slow enough that your user base resists requests to reboot them for patch cycle or power savings efforts or diagnostics.

    So imagine you do the right thing, you keep the BIOS and firmware up to date, you update drivers, you do everything you are supposed to be doing, and they still boot slow.  So you call us in.  And we find (amongst other things mind you) a bug in either the firmware or driver of your storage controller card in all your desktops.

    The delay looks like this:

    clip_image002

    CPU sampling is at 100% for a core, then at 15 seconds, flops to another core, and finally at the 34 second mark, it releases whatever pent up frustrations it has and things get back to normal…

    image

    So at 33.949 seconds, we successfully enumerate a PnP ID

    image

    I look up the PnP ID here to get my offender, its in the properties of the ETL trace…

    image

    Note above, storport.sys!RaDriverPnpIrp starts at .747 seconds and ends at 32.696

    At 32.696, we start the storport.sys!RaidAdapterStartMiniport function…

    This is a perfect example of Windows falling victim to third party code.  We cannot proceed at boot until this driver/firmware initializes, as SYSTEMROOT resides on it…

    More details to follow, customer has escalated to the hardware vendors (two known OEMs so far have this component problem.

  • Disk in-Depth

    I started an article on Disk performance and characteristics for the PFE Performance Wiki a while back.  I had actually forgotten about it (those who know me know my memory is Swiss Cheese sometimes).  Anyway, here is a link to the article:

    http://social.technet.microsoft.com/wiki/contents/articles/disk-in-depth-pfe-performance-guide.aspx

    If you are a disk expert, feel free to critique and/or update Smile

    Cheers,

    jeff

  • DPC triangle…

    So once upon a time, a PFE had a problem with his video game stuttering.  The game would be playing fine and then suddenly, the sound would skip until he rebooted his PC.  Nothing in the event log, device manager reports all good…

    So we collected an xperf trace using the following flags:

    xperf -on Latency+DISPATCHER -f kernel.etl -stackwalk CSwitch+ReadyThread+ThreadCreate+Profile -BufferSize 1024 -MaxBuffers 1024 -MaxFile 1024 -FileMode Circular

    Reproduced the problem..(-FileMode Circular, you see, makes xperf collect a black box trace, auto-overwriting itself, oldest data drops…)

    Then post repro, run

    xperf –d results.etl

    Easy as pie!

    image

     

    DPCs are way high here.  It’s a 4 processor box with very high DPCs on one processor and spikes in other CPUs…

    (DPCs are discussed a bit here and here.)

    So I was interested…

    image

     

    So here we have a gap of time where CPUs were ‘busy’ doing a DPC operation, interrupt, or processing a stack.  For something like .8 seconds, long enough to be a sound gap noticeable to the human ear…

    An analysis of the stacks revealed we were grinding in DPC\USBPort.sys, but in the stack it was referencing USBAudio.sys.

    A quick search found he was running an old version of USBAudio.sys, and this hotfix was available:

    The audio applications stop responding in Windows 7 or in Windows Server 2008 R2 after you resume the computer from the S3 sleep mode

    http://support.microsoft.com/kb/2122063

    The symptoms didn’t match exactly, but once he updated that driver, the problem vanished…

  • BPAs, not just for Exchange anymore…

    So for those who don’t know, I used to be mainly an Exchange dork…ESE flowed in the veins don’t cha know?

    Anyway, one of the tools that rocked (and still does) is the ExBPA and the family of Exchange Analyzers.  But did you know in Windows Server you get analyzers as well?  And some are even built into the OS?

    Active Directory

    AD

    The ADBPA is easily accessible in the Server Manager console in Window Server 2008 R2.  It is not, I repeat, NOT, an ADRAP, by any stretch.  But, it’s free…

    DNS

    has its own as well…

    DNSFree

    What else has a built-in BPA? 

    It’s a secret…just kidding, I’m going to list them right here for you…

    · Active Directory Certificate Services

    · Active Directory Domain Services

    · Domain Name System

    · Internet Information Services

    · Remote Desktop Services

    (Above links stolen from WSiX blog here)

    Basically, some of the core infrastructure parts of Server now have health monitors built in, and you can powershell them!  Why not run them once a week and dump the xml results to a directory archive?

    But what, what else has a BPA?  You might ask…

    GPOBPA

    http://blogs.technet.com/b/askds/archive/2008/04/11/group-policy-best-practice-analyzer.aspx

    Lync 2010

    http://blogs.technet.com/b/ucedsg/archive/2011/02/18/lync-2010-best-practice-analyzer-is-now-available.aspx

    SharePoint

    http://www.microsoft.com/downloads/en/details.aspx?familyid=cb944b27-9d6b-4a1f-b3e1-778efda07df8&displaylang=en

    SBS

    http://support.microsoft.com/kb/940439

    Hyper-V

    http://support.microsoft.com/kb/977238

    TMG

    http://www.microsoft.com/downloads/en/details.aspx?FamilyID=8AA01CB0-DA96-46D9-A50A-B245E47E6B8B&displaylang=en

    SQL 2000

    http://www.microsoft.com/downloads/en/details.aspx?FamilyId=B352EB1F-D3CA-44EE-893E-9E07339C1F22&displaylang=en

    SQL 2005

    http://www.microsoft.com/downloads/en/details.aspx?FamilyID=DA0531E4-E94C-4991-82FA-F0E3FBD05E63

    SQL 2008 R2

    http://www.microsoft.com/downloads/en/details.aspx?FamilyID=0FD439D7-4BFF-4DF7-A52F-9A1BE8725591

     

    All of this is free, as in beer, so check it out.  Script the collections of your core infrastructure services (or other services for that matter) and store that output.  It’s a monitor for un-documented changes Smile).

  • RunAsRadio.com

    I just realized I forgot to link to my xperf into talk on RunAsRadio.com back in October.  Boy is my face red…

    http://www.runasradio.com/default.aspx?showNum=182

    Check out their podcasts, it’s a great repository of technical information!

  • Yes, it matters

    Often at customer engagements when I encourage them to use things like SCCM and SCUP along with HP or Dell SCUP integration to keep drivers (and firmware!) up to date, I’m told it isn’t worth it.  If the drivers from 2006 work, it doesn’t matter if an update is out.

    More often than not, the reason I’m there in the first place is to analyze and solve slow boot/client performance….

    These two statements above are connected, I promise.

    Lets take a walk through my Dell e6500’s life cycle for example.  When I started in PFE I was issued a Dell e6500 laptop with 8 gig of RAM and a 7200 RPM drive.  BIOS rev A08.

    Lets look JUST at BIOS as an example:

    image

    Line item 4.  Slow boot performance (a user isn’t going to understand it’s a PXE issue, they just complain it boots slow)

    But wait, there’s more…

     

    image

    Line 1 is interesting.  If you were rolling out Windows 7 to this machine, it MIGHT work with previous BIOS versions, but wouldn’t it be cool to be in a supported configuration from the company who made the hardware?

    And look, Line 3, updated the Nvidia BIOS, either we’re fixing something or making it faster…

    And hey, Line 6, access speed for PCMCIA cards.

    image

    Hey look, Ambient Light Sensor “improved support” for Windows 7.  Wonder if that fixes my slow boot issue I blogged about previously on ALSSVC64.exe adding 20 seconds to my boot time…

    image

    Ah, we remove, REMOVE, AAM on line 2.  Remember I blogged about this, the feature to slow your hard drive so it boots your system slow but doesn’t make noise…

    But I digress.

    image

    Hey look, line 4.  Nvidia BIOS update again.  Fixes problems or improves performance (or why was it written?).

    Am I picking on Dell here?  NO!  Does Dell make bad hardware and this is why there are so many fixes?  NO!  Every vendor with a brain makes stuff that can be upgraded.  Does anyone recall back when the old Pentiums had the divide by 0 bug?  And hardware was replaced/RMA’d?  Yeah.  Updates = good.  It means the vendor is servicing the product line, taking feedback and aggregating service call data and improving their product for you, the end user/company.

    Note that none of these BIOS improvements cost anything except the time to download and apply them.  Free performance gains.

    My laptop took a nose dive off a desk last week, so I am lacking in good solid pictures to prove the gains, but they are appreciable.

     

    Ok great, lets look at something you paid for, Anti-Virus!

    I am NOT naming this product, it’s a picture example of what a simple update from one version to another can do to the disk IO at boot.  Note that AV engine updates are usually pretty simple to roll out in an enterprise.

    PRE update:

    cleanmachine.disk

    POST update:

    cleanmachine.disk (2)

     

    Tell me, which sytem would you rather want to be on?  Given we’re looking at the disk activity from 0 seconds to 230 seconds, and more disk activity is more likely going to mean a lack of responsiveness, I’ll go with POST myself. Smile

    Ok, what was the point of this rant?

    Glad you asked.  Computers are like anything else.  Our bodies, our cars, our homes, our loved ones, all require maintenance and care.  Give your computer some love today, go to your vendors site, see if any updates are out there, and apply them if they are appropriate, if they make sense, you know?

    SCUP and System Center are a great way to keep things current, with an approval process, in a large environment, and I think they are worth investigating.  Or why not, when rolling out that new image, include a BIOS update as part of the task sequence?

    XPERF, from the WPT, is the way to analyze the impact.  Like in the screen scrapes above.  The ONLY change made, was an AV engine update.  Easy to look at this and say “Yeah, that’s an improvement”.  If its such a subtle thing in a test that you can’t tell, chances are it’s not worth rolling it out in the enterprise you manage.

    Food for thought.

    Jeff

  • Boot delay in MOM 2005 Agent - notes from the field

    Mom2005 agent for FCS v1 can slow boot times by around 20 seconds (typically)...So I did a WDRAP recently where part of their slow boot experience was related to FCS.  It's not FCS's fault though, it rides on top of the old MOM 2005 health agent.  Oddly enough, the MOM 2005 health agent, in the registry called:

     

    HKEY_LOCAL_MACHINE\Software\Mission Critical Software\OnePoint

     Has a value....

     BootStartupDelay: X

     

    Where, at my customers site, the X was 60 (value is in seconds by the way).

     

    Flipping that bit to 0 increased their overall boot time by about 18-20 seconds.  Not a BIG deal, but, a deal nonetheless.  I've done some research on this and no impact has been seen by setting this to 0 so far at any customer....

     

    Another fix would be to check dependencies of the MOM 2005 Agent with the Service Control Manager, ala http://msdn.microsoft.com/en-us/library/ms681957(VS.85).aspxAs the basic problem is that MOM blocks the service control agent while it runs through it's boot delay.

     

    But, I found it easier to flip a bit in the registry from 60 to 0 myself...chances are you will too.

  • The David Solomon TechNet Spotlight Talks are online once again!

    As some would say…Mission Accomplished.  It’s a long story, but I use these videos, particular part of the part 3, in my class to teach how Windows talks to physical RAM.  The students just about universally dig it and now they are rescued from the Akamai cache and posted once again….

    http://vimeo.com/15890452 
    http://vimeo.com/15888263
    http://vimeo.com/15889595

    As the Bristomatique would say, “Share and Enjoy”

  • Troubleshooting slow boot times, Part Deux

    In my last session, I covered a rudimentary usage of XPERF to analyze my slow booting Dell with a SSD in it.  The fix was simple and the problem stood out like a sore thumb.  Resolution was as simple as setting the Ambient Light Service from Dell into an Automatic (delays start) state.

    But what about situations where things aren’t so simple?  What about slow booting machines in the enterprise?  Or even in the small business?  Where third party apps and malware and mis-configured anti-virus products take their toll on an otherwise stellar piece of hardware?

    We have some free tools available to us to help troubleshoot slow logon times.

    First, we have UserEnv logging.  This is alive and well in Windows 7 by the way, the KB just doesn’t reflect that fact.  I’d go over this in-depth, but why re-invent the wheel when it’s already buried in TechNet?  So go here and check it out, a wealth of information is at your fingertips to troubleshoot UserEnv logs.

    Going hand in hand with this is GPLogView, a good tutorial can be found here on it.

    Of course we have Xperf, though there is a learning curve associated with learning it.

  • The effects of Acoustic Management on rotational media disks.

    So one of the trends I’ve been seeing in WDRAPs I’ve performed is that companies are making use of older hardware for newer tasks on a much more frequent basis than before.  Budgets seem to mandate a 4-5 year (or longer) pc recycle timeframe and the net result of this is companies are running their new image of Windows 7 on hardware that in some cases is over 7 years old (personal experience talking here, no statistics to back it up sorry, though that might be interesting).

    So when I go into a company to do a WDRAP I am often evaluating the security and performance of an older chassis.  Something I’m frequently running into is that some models of desktop have Automatic Acoustic Management (AAM) enabled by default to a value of 128 (quiet).  Sometimes, the BIOS is actually set to ‘Bypass’ which at first blush might make the user or administrator think the BIOS has this feature disabled.  Incorrect in my experience!  Bypass actually seems to let the disk decide, so if the manufacturer of a disk set the disk to prefer quiet mode, Bypass will let the disk run at a slower rotational speed to keep the head quiet.

    This increases the seek time noticeably, as well as overall transfer time.  (You can go over more blocks in a minute if you are spinning at 7200 RPMs than if you are spinning at say, 5400 RPM, same goes here for AAM).

    Setting the BIOS to Performance (forcing the drive to run at the 254 level of performance instead of 128/quiet) has caused some boot times of older XP images to speed up by over 100 seconds in the field.

    So really, check out this setting.  You might also note that some hardware vendors in later/modern disables this setting and sells it as a performance gain, rightfully so.  Most drives are fairly quiet these days anyway, so much so that most models of hardware I’ve changed this on the end user doesn’t notice the difference in noise levels, only performance.

    Of course your mileage will vary by model of drive, motherboard, and BIOS.

    Additional links that you might find interesting on the topic are listed here.

  • To Hyper-Thread or not, that is the question….

    So there was some discussion on an internal alias recently, like, well, this morning at 6 A.M….anyway, where we were talking about if having HT enabled was good or bad.  Some of our performance aliases tend to equate a HT processor as 20-30% of a normal processor.  It does give some benefit, but it also can cause some performance degradation.  This is counter-intuitive to the average bear until you start looking at the impact to L1/L2/L3 cache hits with and without HT processors.  Also, some code is optimized for a certain number of processors and adding additional processors is basically a waste of money….

    For example, review the chart on recommended CPU configurations for Exchange Server 2010, found here:  http://technet.microsoft.com/en-us/library/dd346699.aspx

    Anyway, there is no central list of products and their HT recommendations that I am aware of, so I’m making one right now:

    Exchange Server 2010Do not HT your processors

    “Hyperthreading causes capacity planning and monitoring challenges, and as a result, the expected gain in CPU overhead is likely not justified. Hyperthreading should be disabled by default for production Exchange servers and only enabled if absolutely necessary as a temporary measure to increase CPU capacity until additional hardware can be obtained.”

    BizTalk 2004/2006/2006 R2:  Do not HT your processors

    “It is critical that hyperthreading be turned off for BizTalk Servers. This is a BIOS setting, usually found in the Processor section of the BIOS setup. Hyperthreading makes the server appear to have more processors/processor cores than it actually does; hyperthread processors typically provide between 20 and 30% of the performance of a physical processor/processor core. BizTalk Server counts the (apparent) number of processors and adjusts its self-tuning algorithms accordingly; the “false” processors cause these adjustments to be skewed and are actually detrimental to performance.”

    SQL 2005:  It depends

    “The experiment confirms the theory. So does it mean you have to disable HT when using SQL Server? The answer is it really depends on the load and hardware you are using.

    You have to test your application with HT on and off under heavy loads to understand HT's implications.

    Keep in mind that not only lazywriter thread can cause slowdown but any thread that performs large memory scan - for example a worker thread that scans large amount of data might be a culprit as well.

    For some customer applications when disabling HT we saw 10% increase in performance. So make sure that you do your home work before you decide to hyper on not to hyper :-)”

     

    More to follow….

  • Anti-Virus Exclusions and You!

    So there is some amount of confusion on what exclusions are needed for various Microsoft products.  This blog is not necessarily meant to be a definitive list, but is a compilation, a list, of KB articles that point to the various products and their individual guidance on AV exclusions.

    A special shout-out to Aaron Ellison for compiling this list internally!  Go team PFE!

    (social wiki has a dynamic list that may be more updated here: 


    Enterprise Configuration Recommendations:

    http://support.microsoft.com/kb/822158

    Forefront Configuration:

    http://support.microsoft.com/kb/943556

    Forefront:

    http://support.microsoft.com/kb/943620

    http://technet.microsoft.com/en-us/library/cc707727.aspx

    Windows / Active Directory: 

    http://support.microsoft.com/kb/822158

    http://support.microsoft.com/kb/837932

    FRS: 

    http://support.microsoft.com/kb/815263

    SQL:
    http://support.microsoft.com/kb/309422

    IIS:
    http://support.microsoft.com/kb/821749
    http://support.microsoft.com/kb/817442

    DHCP:
    http://support.microsoft.com/kb/927059

    SCOM / MOM:

    http://support.microsoft.com/kb/975931

    Hyper-V:

    http://support.microsoft.com/default.aspx/kb/961804

    http://support.microsoft.com/kb/2628135

    Exchange:

    http://support.microsoft.com/kb/328841

    http://support.microsoft.com/kb/823166

    http://support.microsoft.com/kb/245822

    http://technet.microsoft.com/en-us/library/bb332342(EXCHG.80).aspx

    http://technet.microsoft.com/en-us/library/bb332342.aspx

    Cluster:

    http://support.microsoft.com/kb/250355

    SharePoint:

    http://support.microsoft.com/kb/320111
    http://support.microsoft.com/kb/322941

    SMS:

    http://support.microsoft.com/kb/327453

    ISA:

    http://support.microsoft.com/kb/887311

    WSUS:

    http://support.microsoft.com/kb/900638

    SBS:

    http://support.microsoft.com/kb/885685

    DPM:

    technet.microsoft.com/.../bb808691.aspx

    Dynamics CRM:

    http://community.dynamics.com/product/crm/crmtechnical/b/crminthefield/archive/2011/01/24/anti-virus-exclusions-for-microsoft-dynamics-crm.aspx

    Hope this helps with your configurations!

     

    Cheers,
    Jeff

  • My first MDT 2010 post

    So, I’ve been working on some MDT 2010 work for various customers for about six months or so, but I finally found something that struck me as sort of odd and blog-worthy.

    So I created this big long involved task sequence for a customer and they attempted to lay it down over some older server installs in their lab and ran into errors.  The errors were generic 80004005 errors as seen below, along with DiskPart errors:

    Capture

    Since the drive hasn’t been setup, I frankly wasn’t sure where to look for logging information to be honest.  No MININT directory when the drive isn’t formatted you know?

    So, I sat and thought for a moment.  What could make my C: not present?  Something in the diskpart command.  But what?  As I sat pondering it, I went back over my task sequence in my head (I didn’t have access to the console at the time).

    One thing we had done, was specified larger drives for C: (they were moving from 2003 images to 2008 R2, and 2008 R2 requires a larger footprint on the disk).  The disks for the old system were likely setup in the SCSI RAID controller for the local machine.  Which means from WinPEs’ view, it’s a drive right.  So I looked in diskpart after hitting F8 here and look what I saw:

    Capture1

    Sure enough, disk 0 is 15 gig, my task sequence is configured to format the 1st disk to a 50 gig C: partition and then carve out the rest for D:.

    Disks re-configured in the SCSI controller to one big fat disk and viola, everything works.

  • Today I was a WSUS/IIS Engineer

    And it wasn't half bad.  At this customer site I am at currently doing MDT 2010 deployment creation for a Windows Server 2008 R2 rollout, WSUS was breaking for the desktop deployment folks.

    WSUS was enabled on a Windows 2008 R2 server.  The website couldn't be accessed, giving a server 500 error.  When I looked in the Application and System event logs, two things stood out at me.

    The first thing that caught my eye was in the System event log.  A 2025, from SRV stating that the MDT reference machine in a VM on the 2008 R2 host was doing a possible Denial of Service attack against the 2008 R2 server and the connection was closed.

    Odd.

    Second was that in the logs for WSUS, 13042, could not self update.  Strange.  I started messing around with it, and long story short of it, the service that the Application Pools in IIS were running under did not have any rights to the IIS folders.  Restoring rights to the IIS folders resolved the issue and WSUS happily patched the MDT Reference image.

  • Why do I have long boot times? Pt 1

    So one of the questions that comes to mind every now and then in technical circles (and outside as well) is "Why doesn't it take so long for my machine to boot?".  Just what's going on in there while these friendly, soothing graphics come up on the screen, and I wait and wait for a prompt to login?

     

    Great question.  I recently purchased a solid state drive for my laptop and after imaging it with Windows 7 and loading it all up with drivers and whatnot, I had the same question.  So I went off an looked to find out what the 'deal' was.

     

    So I went to the Windows Performance Analysis website and downloaded and installed the Windows 7 SDK, which includes the Windows Performance Toolkit (mainly, for this exercise, xbootmgr.exe and xperfview.exe. 

     

    (So xbootmgr will tell the kernel to start tracing at boot and tell Windows to restart so it can get on with the trace.  So be prepared for the system to reboot you when you type this in and hit enter!)

     

    Anway, after download and installation, I did the following from an elevated command prompt:

     

    xbootmgr -trace boot -traceflags BASE+CSWITCH+DISK_IO_INIT

     

    I did this in a directory where I had room for a couple hundred meg etl trace and it was nice and tidy so I didn't have to hunt for anything.

     

    My system rebooted and as soon as I was presented with a command prompt, I logged in.  After the shell came up, I had a window on my screen that basically counted down post boot tracing, for 120 seconds.  When I see this I just let it do its thing.

     

    After 120 seconds, it wraps all this data into an etl file named boot_BASE+SWITCH+DISK_IO_INIT.etl in the directory where I ran the command prompt.

     

    After the system is done collecting its data and waiting on the prefetcher info and whatnot, I then go in and do the fun bit, open the ETL file with xperfview.

     

     

    I immediately am drawn to the wide gap where nothing happens in my services list, from the 22 to 38 second mark.  Turns out this is the ambient light sensor for my keyboards backlight.  It takes the driver a bit of time to figure out the ambient light where I'm at to make a judgement call on if a backlit keyboard is needed.  In Windows 7 there is a handy feature for services called "Automatic (Delayed Start)".  I put the service into that state and rebooted and I saved 16 seconds on my boot time.  A decent gain I think.

     

    Now, this is a very, very rudimentary explanation of how to review an ETL file, something simple to look for, a beginners example.  I highly recommend going further with ETL / WPT, as it is a very insightful glimpse into Windows system performance.  To dig further, I've collected some links from a list that is floating around internally...

    Performance toolkit (XPERF) log & analysis

    The required steps to collect xperf logs on XP / Vista are as follows:

    1) Download & Install the toolkit on a Vista/2008/Windows 7 machine.
       The latest version of the Windows Performance Toolkit is part of the Windows 7 SDK (which is a huge download). The following blog has the steps to download the ‘bare minimum’ to get the WPT toolkit.

        http://blogs.msdn.com/jimmymay/archive/2009/11/24/xperf-install-windows-performance-toolkit-wpt-with-242mb-download-not-2-5gb-windows-7-sdk-part-2.aspx

    2) Copy the contents of the “C:\Program Files\Microsoft Windows Performance Toolkit”  to a folder on Windows XP (or a USB memory stick).

    3) Turn off the  “No Execute” or ‘Execute Disable” security option for CPU in the BIOS (or if you cannot find the appropriate BIOS setting, add the following switch to the boot.ini file:  noexecute=alwaysoff)

    4) Either run xbootmgr from the command line, or use the XPerfUI utility which you can download from our codeplex website: http://xperfui.codeplex.com/

    5) Copy the resulting .etl file to the Vista machine to use the xperfview GUI to open & analyze it. If a userenv log is generated under %windir%\debug\usermode, it can also be copied to correlate processes & times.

     

    MSDN documentation link for the Windows Performance Toolkit:

    http://msdn.microsoft.com/en-us/library/cc305187.aspx

    Windows On/Off Transitions Solutions Guide  (Diagrams)

    http://www.microsoft.com/whdc/system/pnppwr/powermgmt/OnOffTrans.mspx

     

    On/Off Transition Performance Analysis of Windows Vista (Vulnerabilities)

    http://www.microsoft.com/whdc/system/sysperf/On-Off_Transition.mspx 


    Xperf UI – GUI wrapper for the Xperf command line tool
    http://xperfui.codeplex.com/

    Also a good blog for more information

    http://blogs.msdn.com/pigscanfly/pages/xperf-articles.aspx


    Two Minute Drill: Introduction to XPerf

    http://blogs.technet.com/askperf/archive/2008/06/27/an-intro-to-xperf.aspx

     


    More notes on xperf:

    To show if there are any active loggers

    Xperf –loggers     

     

    To stop any active loggers

    Xperf –stop

     

    To view help on available flags

    Xperf –providers  i

    Xperf –help providers

     

    To trace any process ad hoc including cpu, disk and registry

    Xperf –on diageasy+registry

    <let the activity happen>

    Then stop and merge the wmi / etl data into the log file

    Xperf –d mytrace1.etl

     

    To view the traces; (only works on Vista  or Svr 2008 or later)

    Xperf <logname.etl>

    Or

    Use xperfview as the GUI

     

    Special thanks to Fatih Colgar and Roger Southgate for comprising the "Performance Toolkit (XPERF) Log & Analysis" links and walk through.

  • GUI Tool to collect ETW tracing, dumps, etc

    http://visualstudiogallery.msdn.microsoft.com/en-us/e8649e35-26b1-4e73-b427-c2886a0705f4

     

    So, check this out.  It allows you to collect ETW tracing, dumps, all kinds of stuff.  It does not work on Windows XP, but still, a handy little tool nonetheless.

  • SPA, not your typical freeware

    In the Vital Signs workshop, we touch upon the tool SPA (Server Performance Advisor).  This unsung hero of performance evaluation deserves some love, which is why I'm writing about it over 5 years after its last update was published and made available on the downloads site, here:

     http://www.microsoft.com/downloads/details.aspx?FamilyID=61a41d78-e4aa-47b9-901b-cf85da075a73&displaylang=en

    So, Clint Huffman, creator of PAL, wrote up this excellent article on how to troubleshoot server performance problems...

    So, check it out here:

     http://channel9.msdn.com/Wiki/PerformanceWiki/HowToIdentifyBottleneckSPATool/

     It's the bomb, and it's free as in beer.

  • On the road again, I just can't wait to get on the road again....

    So I'm a PFE now, Premier Field Engineer.  It's an interesting gig, sort of like running your own company within Microsoft.  I'm doing Platforms now, instead of Exchange.  Trying to keep the mind limber and all.

    So far I've been doing shadowing of other PFEs as they do things like Active Directory Risk Assessments and what have you.

     

    I'm looking forward to helping our customers proactively instead of being in a constant reactionary state of crit sits and whatnot.

     

    More on this soon.

  • Defeated by Unexpected Transaction Log File growth

    Applies to Exchange 2003, concepts apply to 2007 as well. 

    I've bumped into a few cases recently where the customer had unexpected transaction log file growth that caused the server to dismount a storage group due to lack of disk space.  In this post I'll attempted to explain why this occurs, how to troubleshoot it, etc.

    The short of it is transaction log file growth usually occurs because of a repeating transaction.  It can be a looping message, a mis-behaving client, or a corrupt message.  Looping messages I've seen done by users setting up special things on their Outlook clients.  Consider the following example:

    A user leaves for the weekend.  They are expecting an important email, so they put in a forward rule to forward all email to their mobile phone's email address.  They either  1)mis-type that address, or 2)their phone's email box doesn't accept messages above a certain size.  In the event of 1), every message sent to the user is going to hit the mail servers of the phone provider and bounce with an invalid address.  This NDR will come back and hit the mailbox of the user, where the forward rule will forward the NDR to the phone, which will bounce and come back to the inbox, where it will forward the NDR to the phone......  In the event of 2), any message above the size limit will trigger the loop above (unless the ISP's mail server knows not to append the offending email as an attachment to the NDR).

    This is a real world example I've personally run into.  Users can and will do all kinds of bizarre things that under the light of day seem obtuse, but in the heat of the moment make sense.

    So how do you track this down?

    The normal troubleshooting path I take for this type of problem is:

    1.  Run Exmon.  Tell me if a single user is taking something silly like 50% of the servers resources.  If you're spooling out transaction logs like it's nobody's business and Exmon shows a user at 50%+ and they are in the same Storage Group as the spooling transaction logs, then chances are you've found your man.  If Exmon doesn't point anything out of the ordinary, then proceed to step 2:

    2.  Go to your Exchange System Manager, drill down to the Storage Group that you're seeing the transaction log growth on.  Expand each database and visit the logins area.  Add columns for MSG Ops, Folder Ops, Total Ops, and sort by high/low and see if you have one user towering above the rest.  Do this for each database.  If you've got a single user standing out, again, this is very likely your culprit.  Log into their mailbox, see if there is something stuck in the Outbox, or check their active client for any client-side rule that may be at fault.  Worse comes to worse, disable the user's mail.

    3.  User Scott Oseychik's guide on Transaction Log analysis to figure out what the offending message might be:

    http://blogs.msdn.com/scottos/archive/2007/07/12/rough-and-tough-guide-to-identifying-patterns-in-ese-transaction-log-files.aspx

    This is an excellent guide and needs no further clarification.

    4.  If this doesn't work out for you at this point, call into support, it could be a problem with a mobile device syncing or an OWA session trying to process a corrupt message (I've seen both scenarios).  Only a series of store dumps collected with adplus will tell us that.

    I hope this helps in your troubleshooting efforts.

  • How to get rid of 9646 events

    Applies to Exchange 2003

    Event Type: Error
    Event Source: MSExchangeIS
    Event Category: General
    Event ID: 9646
    Description:
    Mapi session "/o=<org>/ou=First Administrative
    Group/cn=Recipients/cn=<userName>" exceeded the maximum of 32 objects of type "session".

    Seen these in your environment?  Sometimes they are caused by desktop search engines opening too many MAPI sessions, other times they are because your Exchange server is keeping open connections from the client that the client for whatever reason thinks are closed.

    For example, say you have users connecting to Exchange over a poor network connection or VPN.  When Outlook connects, it establishes MAPI sessions.  If the users drops the VPN connection without closing Outlook first, those connections are going to stay open on the Exchange server for 2 hours.  If the user connects again, more connections get added to the Exchange server for that user.  See where we're going with this?  If you max out, new connections will fail, resulting in an unhappy end user.

     So how do you fix this?

    Well, one of three ways.

    You can try to prevent your clients from doing so many connections by educating your users, making changes in Outlook 2007, correctly configuring Network Accelerators to not keep connections open, etc.

    You can tell Windows/Exchange that 2 hours is far too long to keep a session open without activity.  To do this you follow the instructions in this document, specifically the TCP KeepAliveTime, set that to 5 minutes.

    KB324270 How to harden the TCP/IP stack against denial of service attacks in Windows Server 2003.
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;324270

    Or finally, and the last resort, you can add additional allowable sessions by following KB842022.  Note that this is a last resort, as in many cases you are merely delaying your pain for later.  Note the warning:  "If you do this, try to determine the minimum value that you can use so that the client program can run without problems. If you raise the limit too high, the client program might affect the performance of the Exchange Server computer."

    Event ID 9646 is logged in the application event log of your Exchange Server 2003 computer when a client opens many MAPI sessions.
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;842022.

  • Database Status Unknown in 2k7 EMC? This might be why....

    Applies to Exchange 2007 

    So I ran into a case today that was pretty interesting.  The symptoms were a fairly generic error in the app log:

    Event ID: 4001
    Task Category: General
    Level: Error
    Keywords: Classic
    User: N/A
    Computer: server name
    Description:
    A transient failure has occurred. The problem may resolve itself in awhile. The service will retry in 56 seconds. Diagnostic information:
    Kerberos test. . . . . . . . . . . : Failed
    [FATAL] Cannot lookup package Kerberos.
    The error occurred was: (null)

    And in the Exchange Management Console, all the databases reported back a status of "Unknown".  Also you couldn't run the EXBPA on the server, it came back with a network/registry error:

    Error (The network path was not found) opening registry key reg:/servername/HKEY_LOCAL_MACHINE/Software\Microsoft\Windows NT\CurrentVersion, skipping object.

    The customer could run the EXBPA against the server remotely, and remotely in the EMC the database status came back Healthy instead of Unknown.

     

    Weird huh?

    Turns out that there were corrupt / bad entries for the machine name in the hosts file, and it was causing all three symptoms.  #'ing the records and doing an ipconfig /flushdns resolved everything in just a minute or two.