Welcome to TechNet Blogs Sign in | Join | Help

Ask the Core Team

Microsoft Enterprise Support Windows Server Core Team

News

  • Disclaimer: All postings are provided "AS IS" with no warranties, and confer no rights. This weblog does not represent the thoughts, intentions, plans or strategies of Microsoft. Because a weblog is intended to provide a semi-permanent point-in-time snapshot, you should not consider out of date posts to reflect current thoughts and opinions.

    Locations of visitors to this page
MDT 2010: Incorrect wimgapi.dll version causing WIM mounting issues

Today’s blog will cover an issue we have seen with Microsoft Deployment Toolkit 2010.  You may see one or more of the following error messages when generating or updating boot images and other actions in MDT that involve the Lite Touch Images:

  • Unable to mount the WIM, so the update process cannot continue.
  • Unable to load DLL 'wimgapi.dll'
  • Mount did not succeed

You can also run into errors when Windows System Image Manager tries to catalog an image.  This issue can occur because we are finding an incorrect version of WIMGAPI.DLL first in the path

Steps to Resolve

At a command run the following command

Where wimgapi.dll

For the first location listed in the page verify that the .DLL Version is correct.  If MDT 2010 is installed on Windows 7 or Windows Server 2008 R2 you should see wimgapi.dll in two locations because it ships with the operating system and with the Windows Automated Installation Kit (WAIK). 

C:\windows\system32\wimgapi.dll
C:\program files\windows imaging\wimgapi.dll

Both versions should be 6.1.7600.16385.  The version could be later than this if any updates have shipped that replace this file at a later date

Notes: 

  • This issue can occur if you installed a beta version of the WAIK or System Center Configuration Manager (SCCM)
  • You may find other versions in locations like Program Files(X86) or other directories.
  • If MDT is installed on Windows XP or Windows Server 2003 wimgapi.dll should only be found in C:\Program Files\Windows Imaging
  • The version of WIMGAPI.DLL may be updated in later releases of the Windows AIK or updates to Windows. 

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Unable to select an attached VHD as a Shadow Copy Storage location

You may notice that you cannot choose to store Shadow Copies on an attached VHD, and that when configuring Shadow Copy protection on an attached VHD, there are no other locations available to store the copies on, other than the protected VHD volume.

This behavior is by design. Illustrated below is the behavior you may see while working with attached VHDs and Shadow Copies. I’ll provide some additional information at the end of this post that will explain why this behavior is occurring.

To get into a configuration where we’ll see the behavior I’m describing:

We create our VHD in Fig.1

(Fig.1)

clip_image001

Remember the size of the destination volume for shadow copies must be at least 300mb before you can store a single shadow copy on the volume

(Fig.2)

clip_image002

Here, in Fig.2, I’m making it 350mb so that there wouldn’t be any concern about the volume being too small and thereby omitted from the list of possible locations.

Below in Fig.3, we’re going through the steps of initializing the new disk

(Fig.3)

clip_image003

clip_image004

Now, in Fig.4, we’re putting a simple volume on it

(Fig.4)

clip_image005

(Fig.5) Enabling Shadow Copies on the C:\ volume

clip_image006

In Fig.5 above, I’ve pulled up the properties of the C:\ in preparation to enable Volume Shadow Copies. You can see here, that Disk 3 (F:\) is present in the list of volumes that we can enable the feature on. F:\ is the attached VHD we created in Fig.2.

In Fig.6, let’s enable Shadow Copies on C:\ and then choose to put the copies on our F:\ drive…

(Fig.6)

clip_image007

Whoa… where’d the F:\ drive go?

As you can see from the list, the attached F: (VHD file) is not listed as a valid target for Shadow Copy Storage.

Also, while you can configure the attached VHD to be protected with Shadow Copies, you cannot store the Shadow Copies on any other volume but itself.

clip_image008

Why is this happening?

A virtual volume cannot be used as the target volume for the snapshot of another volume, and the volume can only hold shadow copies associated with its own snapshots. VSS will constrain the shadow copy storage area for attached VHDs to only allow the volume to host its own shadow copies. This behavior will ensure that the volume is self consistent, and therefore, maintain its portability. Portability is one of the bigger design goals behind the ability to create and mount VHDs under Windows 7 and 2008/R2. The end result of this behavior is that snapshots of the volume will travel with the volume as it is deployed and moved around your environment so there’s never a need to worry about a volume missing when it comes time to restore a previous version of a file.

For additional information, be sure to visit these links:

Frequently Asked Questions: Virtual Hard Disks in Windows 7

http://technet.microsoft.com/en-us/library/dd440865(WS.10).aspx#one

Understanding Virtual Hard Disks with Native Boot

http://technet.microsoft.com/en-us/library/dd799282(WS.10).aspx

Thanks for taking the time to read this. I hope it’s helped you understand one of the caveats that can be seen when using Native VHD support!

Sean Dwyer
Support Escalation Engineer
Microsoft Enterprise Platforms Support

 

The Four Stages of NTFS File Growth

In my quest to better understand the interworking of how NTFS stores information on disk, I have been researching what happens to a file as it grows in size and complexity.  The reason I’m after this knowledge is so I can better troubleshoot certain storage issues. 

Recently, I realized that I’d stuffed my head with enough information to make a pretty good blog.  Read along as I explain what I call ‘the four stages of file growth’.

Before we can address file growth, we need to first look at how NTFS works under the covers. 

Let’s start out with some basics.

When NTFS stores a file, it starts by creating a small 1KB file record segment that we will call the base record.  Every file starts like this, including the special hidden files such as $MFT, $LOGFILE, $VOLUME and so on.  In fact when we refer to the MFT (master file table), what we are talking about is the entire list of base record segments and child record segments (explained later) for all files in the volume.

For today, we are just going to talk about some simple text files.  You will see it getting complex enough without us doing anything fancy.  Here are three base records for three text files.

clip_image002

Before going any farther, it is important to clear up a common misconception on what a file really is.  We tend to think of the data in our file as the file itself.

clip_image004

The truth is that data is just one attribute of a file.

clip_image006

Every file record starts with a header, and then has various attributes, each attribute having its own header.  For small files, it is common to find the data attribute last.

Do not confuse these attributes with file attributes like Read-only, Hidden, or System (which are actually just flags).  Think of attributes as structures within the file that define things about the file.  Common attributes are $STANDARD_INFORMATION, $FILE_NAME, and of course $DATA.

clip_image008

Any space left over in the 1KB record is unused until one of the attributes needs it or a new attribute is added.

Now let’s watch our file grow….

Stage one – Completely resident

I created a small text file with just one line of text in it.  This file was so small that it was able to fit all parts of the file into its base record.  We call this being resident, as the data for the file resides in the base record segment.  This also means that the entire file exists in the MFT.  No need to look elsewhere.  Everything we need is in that 1KB record.

The diagram shows our 1KB base record segment for the file File1.txt.  Inside you can see the data attribute and the file data within it.  The file data, also known as the stream for this attribute, is what we as computer users tend to think of when we think about a file.  We don’t think about all the structures involved in storing the stream.

clip_image010

Along with the data that we put in the file, you can also see that we have lots of room still left in the 1KB base record segment. 

To make the file grow, I just pasted the same line of text into it a few more times.  Soon I had the file looking like this....

clip_image012

This was about as big as I could get the file before it was too big to fit into the 1k range of the base record segment.  Any bigger and we go to stage two.

Stage two – Nonresident Data

Once the data starts to push out toward the end of the 1KB base record segment, the data will be shipped outside and stored elsewhere on the disk.  To keep track of where it is, we maintain a mapping pair that tells us the location and length of the now nonresident data.  The new location is outside the MFT and is simply an allocated range of clusters.

NOTE:  At his stage the file data is nonresident, but the attribute record is in the base record segment.

clip_image014

As the file continues to grow we will either increase the length defined by the mapping pair, or if we can’t store the data contiguously, we create more mapping pairs. 

Eventually, the file starts to look like this….

clip_image016

Stage three – Nonresident Attribute

When an attribute grows to the point that the list of mapping pairs no longer fits into the base record segment, it too is shipped out but this time it is housed in a new child record.  To keep track of this child record a new structure is created in the base record.  We call this new structure an attribute list or $ATTRIBUTE_LIST.

clip_image018

Each entry in the attribute list points to the file record where each attribute instance can be found.  There will be an attribute list entry for nearly every attribute that the file has.  The exception being that there isn’t a list entry for the attribute list.  For the attributes still resident (like the $FILE_NAME attribute), their respective list entry will simply point back to the base record segment.  The diagram above shows only the one entry that corresponds to the $DATA attribute.  The other entries are left out of the diagram to keep it readable.

After even more data is stuffed in the file it branches out and creates more child records as needed.  Each child record has an entry in the attribute list that points to it.

 clip_image020

This is somewhat different than what we did when we moved file data outside the base record segment.  When the file data was moved, the new location on disk contained no attribute information.  It just had data.  If viewed in a sector editor, it would just show lines and lines of file data.

The child records are just that, records.  They contain elements common to those found in a base record segment.  It will have a MULTI_SECTOR_HEADER and one or more attribute records….along with some mapping pairs.  The pairs themselves will point to the allocated clusters that contain actual file data.

The information dumped out of the file gets more complex at each stage.  But fear not, it’s almost over.

Stage four – Nonresident Attribute List

The final stage of file growth occurs when attribute list contains so many entries that the list itself no longer fits in the base record segment.  When we reach that point, the attribute list is shipped outside the record into an allocated cluster range and an attribute list record is left behind to track the location of said cluster range.  The new location of the attribute list is outside the MFT and is similar to how we are storing the chunks of data that make up the $DATA:”” stream(shown in the red boxes) in that it is not an actual child record.

The dotted line shows the entire stream as it would be virtually.  Logically these chunks of data will be found all over the storage device. 

clip_image022

Unlike the child records and the data instances, a file can only have one attribute list and the $ATTRIBUTE_LIST record must reside in the base record even though the list is nonresident.

In review

Stage one- Completely resident

A file starts out simple, storing file data locally.

clip_image023

Stage two- Nonresident data

When the data will no longer fit in the 1KB range, it is moved to another part of the disk.

clip_image024

This process can result in multiple mapping pairs.

clip_image025

Stage 3-Nonresident attribute

When the mapping pairs are too numerous, they are moved out to form their own child record. 

clip_image026

An attribute list entry is created for each child record.  Multiple attribute list entries will mean multiple child records.

clip_image027

Stage 4-Nonresident attribute list

Lastly, when the list of attribute entries is too large to be stored inside the base record segment, the attribute list itself becomes nonresident and moves outside the MFT.

clip_image028

The greater the complexity used to store the file, the greater the performance hit will be to your computer when retrieving and storing the file.  Things like compression, file size, number of files, and fragmentation all can greatly affect this complexity and therefore affect your computer experience. 

Robert Mitchell
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

KMS Host Client Count not Increasing Due to Duplicate CMID'S

Hello, my name is Scott McArthur.  I am a Senior Support Escalation Engineer in the Windows group and today’s blog will cover an issue involving KMS activation and deployment of images.    This issue seems to be more prevalent today due to the various tools used to clone images, create images based on a templates, Physical to Physical (P2V) tools, etc... 

Generally the symptom you will see is that the count on your KMS host will not show the correct number of clients or it will not increase.  There are a number of reasons why this can occur but a common reason is that sysprep was not used when preparing images for deployment. 

Any time you use imaging as your deployment method it is required that you run sysprep to prepare the image for deployment. 

This policy is outlined in the following KB article:  http://support.microsoft.com/default.aspx?scid=kb;EN-US;162001

To determine if you are encountering this you can use the Key Management Service Log do the following:

1. On your KMS host open Event Viewer
2. Right click the Key Management Service Log and choose “Save all events as”
3. Change the Save as type to Text(Tab Delimited)(*.txt)
4. Save the file as KMS.TXT
5. Close out of the Event Viewer completely
6. Open Excel
7. Click File, Open, and browse to KMS.TXT
8. You should see the Text Import Wizard. Choose the following options
    Delimited
    Start Import at Row: 8
    Delimiters: Comma
9. When complete the data may look all messed up. Don’t worry we will correct that
10. Click the upper left of the spreadsheet to select the entire spreadsheet
11. Click Data, Sort, In the Sort By selection choose “Column D”
12. When complete you should see the data sorted in columns.

The Client Machine ID (CMID) is how we uniquely identify a KMS client.  When sysprep is run one of its jobs is to generalize this GUID so when the image is deployed every machine has a unique CMID.  Here is an example output

Column C-Computername

Column D-CMID

TEST-03.contoso.com

01eb9985-230c-49ad-a8c2-c24914da4739

TEST-04.contoso.com

01eb9985-230c-49ad-a8c2-c24914da4739

TEST-02.contoso.com

01eb9985-230c-49ad-a8c2-c24914da4739

TEST-01.contoso.com

01eb9985-230c-49ad-a8c2-c24914da4739

From this output you can see that multiple computernames have the same CMID. Each computer should have a unique CMID. This means that sysprep /generalize was not used to prepare these computers for deployment. So to KMS those 4 machines appear as one. That what be why the count would not be increasing or not reflect the true number of machines deployed.

While it is possible to run slmgr.vbs /rearm to reset the machines CMID that does not leave the machine in a supported state. Images deployed without using Sysprep to prepare the image are not supported by Microsoft. Sysprep executes ~30 sysprep providers. These providers are written to correct issues with various components when you duplicate the installation. By not running sysprep it is unknown what types of issues you could encounter and many components will be in a broken state. The supported solution is to rebuild the image using the Sysprep /generalize switch and redeploy the systems.

Thanks for your time. Stay tuned to our blog for more Activation and Deployment Topics.

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

How to run a Sysprep and Capture Task Sequence From MDT 2010

Hello, my name is Kevin Ledman. I am a Support Escalation Engineer in the Windows group and today’s blog will cover how to run the new Sysprep and Capture Task Sequence included with MDT 2010.

If you choose to deploy an operating system manually or need to make customizations outside of the MDT task sequencer, you can still use MDT to automatically sysprep and capture the image for future use.

To configure the task sequence, launch the MDT 2010 deployment workbench and create a new task sequence using the sysprep and capture template.  Answer the remaining wizard items, making sure to choose an OS source that matches the OS you are going to be capturing.

  clip_image002

Update your deployment points and switch to the reference computer to start the task sequence.  **A common mistake at this point is to boot the reference computer from your LiteTouch image and start this task sequence.  The sysprep and capture task sequence is designed to be run from the desktop of the reference machine similar to a post OS installation task sequence.  To launch this, you will need to establish connectivity to the deployment share and launch LiteTouch.WSF manually.  Because you are logged in to the reference machine as a local administrator and not joined to a workgroup, be sure to establish the session under the same security context that will be used for the task sequence:

net use * \\mdtserver\DeploymentShare$ /user:domain\username

Once the connection is established, execute LiteTouch.WSF:

cscript \\mdtserver\DeploymentShare$\Scripts\LiteTouch.WSF

 

clip_image004

The MDT Wizard Screens will launch and prompt for the information required to complete this task sequence.  **Note – we will still process customsettings.ini for this task sequence.  If you have modified customsettings.ini to skip wizard screens, those settings will be honored with this task sequence as well.

clip_image006

Choose the task sequence you have created.

clip_image008

Choose the capture option and supply the location and file name.

clip_image010

Supply the credentials that LiteTouch will use to connect to the deployment share.

clip_image012

View the summary and click ‘Begin’ to start the task sequence.  If you receive error “A connection to the distribution share could not be made” see the following blog: 

http://blogs.technet.com/msdeployment/archive/2009/09/18/fix-for-multiple-connections-to-a-server-or-shared-resource-by-the-same-user-using-more-than-one-user-name-are-not-allowed-problem-with-mdt-2010.aspx

clip_image014

MDT will copy the necessary files to the reference computer, launch sysprep, apply the LiteTouch Image and reboot the machine

.

 

clip_image016

LiteTouch boots and begins the capture of the image.  Depending on the size of the installation, this may take a significant amount of time.

 

Once the capture has completed, you can now import the captured image as a custom image file in MDT and use it for future task sequences.

clip_image018

Add new operating system and choose custom image file.

clip_image020

Point to the “Captures” path and move it to the to the deployment share.

clip_image022

Include the setup files for the OS which you are importing and complete the wizards.

clip_image024

The operating system is now ready for use with new task sequences.

Kevin Ledman
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Invalid Product Key Error Specifying MAK key in unattend.xml

Hello, my name is Scott McArthur.  I am a Senior Support Escalation Engineer in the Windows group and today’s blog will cover an issue involving specifying MAK Product Keys during setup of Windows 7 and Windows Server 2008 R2. 

When deploying a Volume License (VL) version of Windows 7 or Windows Server 2008 R2 you may encounter the following error message:

The unattend answer file contains an invalid product key.  Either remove the invalid key or provide a valid product key in the unattend answer file to proceed with Windows Installation

The setuperr.log will log the following

2009-10-01 12:52:38, Error      [0x060551] IBS    Callback_Productkey_Validate: EditionID for product key was NULL.
2009-10-01 12:52:38, Error      [0x060554] IBS    Callback_Productkey_Validate: An error occurred writing the product key data to the blackboard.
2009-10-01 12:52:38, Error      [0x06011a] IBS    Callback_Productkey_Validate_Unattend:Product key did not successfully validate.[gle=0x00000490]
2009-10-01 12:52:38, Error      [0x0603c7] IBS    Callback_Productkey_Validate_Unattend:Did not pass validation; halting Setup.[gle=0x00000490]
2009-10-01 12:52:38, Error      [0x060120] IBS    Callback_Productkey_Validate_Unattend: An error occurred preventing setup from being able to validate the product key; hr = 0x80300006[gle=0x00000490]

This error can occur if you have specified a Multiple Activation Key (MAK) in your answer file in the WindowsPE phase of setup.  VL versions do not prompt for a ProductKey so they do not need a ProductKey during the WindowsPE phase of setup.  The ProductKey can be specified by clicking “Change Product Key”, SLMGR.VBS /IPK, or specifying it in the answer file. 

The ProductKey entry is found in 2 places in Windows System Manager:

Microsoft-Windows-Setup in WindowsPe phase

image

The above entry would be used when using retail media. 

Microsoft-Windows-Shell-Setup in Specialize phase

image

To resolve this issue use the Microsoft-Windows-Shell-Setup component in the Specialize phase.  In order to get the ProductKey entry you must right click the Microsoft-Windows-Shell-Setup component under Windows Image and add it to the Specialize phase. 

image

ProductKey will not show up under the component(On the Answer File pane) until you actually add it.  The unattend.xml will look like this

<?xml version="1.0" encoding="utf-8"?>
<unattend xmlns="urn:schemas-microsoft-com:unattend">
    <settings pass="specialize">
        <component name="Microsoft-Windows-Shell-Setup" processorArchitecture="x86" publicKeyToken="31bf3856ad364e35" language="neutral" versionScope="nonSxS" xmlns:wcm="
http://schemas.microsoft.com/WMIConfig/2002/State" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <ProductKey>xxxxx-xxxxx-xxxxx-xxxxx-xxxxx</ProductKey>
        </component>
    </settings>
    <cpi:offlineImage cpi:source="catalog://server/catalogs/win7/enterprise/x86/install_windows 7 enterprise.clg" xmlns:cpi="urn:schemas-microsoft-com:cpi" />
</unattend>

Note:  If you are using Microsoft Deployment Toolkit 2010 for your deployments you can enter the MAK key during the ProductKey prompt during the Lite Touch Wizard.  Hope this helps with your deployments and keep an eye on our blog for other activation issues. 

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Failover Cluster Validation Firewall Error in Windows Server 2008 R2

An issue involving a firewall configuration error in the cluster validation process just surfaced here in Microsoft Support so I thought I would post a quick blog in an effort to not only inform our readership, but to ‘nip this in the bud’ before we start seeing more. 

                After running a Windows Server 2008 R2 Failover Cluster validation report, you may see the following error –

“An error occurred while executing the test.  There was an error verifying the firewall configuration.  An item with the same key has already been added”

The error, as is, does not provide a clear direction to take when trying to troubleshoot.  Thanks to the efforts of Cluster Product Group, the source of the issue was identified and a quick data collection process can be executed to help determine the ‘root’ cause.

The firewall configuration error is reported if any of the network adapters across the cluster nodes being validated have the same Globally Unique Identified (GUID).  This can be determined by running the following WMI query on each node in the cluster and comparing the results.  I chose to run the query inside PowerShell  to display sample data in a formatted list-

GetWMI Win32_NetworkAdapter | fl Name,GUID

clip_image002

The sample output above shows the information associated with the three physical network adapters that exist in one of the nodes in my cluster.  After the data is gathered from each node in the cluster, you just need to compare it and identify the duplicate GUID information.

The next logical question is, “How does one find themselves in this predicament?”  In the cases we have encountered thus far, the cluster nodes were being deployed in an unsupported manner.  In each case an ‘image’ was being used to deploy the nodes.  We discovered that the operating system image was not properly prepared before being deployed by, for example, running sysprep.

Hopefully this information will be useful and will help avoid further occurrences of this issue.  Thanks again and please come back.

Additional References:

Failover Cluster Step-by-Step Guide: Validating hardware for a Failover Cluster

KB 943984:  The Microsoft Policy for Windows Server 2008 Failover Clusters

Deployment Tools Technical Reference

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Top Issues for Microsoft Support for Windows Server 2008 Hyper-V (Q4)

It is time for the final installment of a year-long segment on the top issues in Hyper-V.  It is appropriate since Windows Server 2008 R2 has finally released, and we can look forward to tracking\reporting any issues we may find in the new version of Hyper-V.  As always, the issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed.  I think you will notice that the issues for Q4 have not changed much from Q1\Q2\Q3.  Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether (if you have been following this blog series, you will notice some already have).   Additionally, we continue to highly recommend the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

Resolution:  Information is provided on the Hyper-V TechNet Library which includes links to several Product Team blogs.  Additionally, the Microsoft Virtualization site contains information that can be used to get a Hyper-V based solution up and running quickly.

Installation Issues

Issue #1

After the Hyper-V role is installed, a customer creates a virtual machine, but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS. In some cases, the server needs to be physically shutdown in order for the new BIOS settings to take effect.

Issue #2

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #3

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
vmname ’ failed to initialize.
Failed to read or update VM configuration.

or

An error occurred while attempting to change the state of virtual machine vmname .
" VMName " failed to initialize
An attempt to read or update the virtual machine configuration failed.
" VMName " failed to read or update the virtual machine configuration: Unspecified error (0x80040005).

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Virtual Devices or Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Corrupted virtual hard disk (VHD) file.

Cause: The most common cause was a power outage or the server wasn’t shutdown properly.

Resolution: Restore the VHD file from backup.

Issue #3

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached (driver issue - Storvsp.sys).

Resolution: Perform the steps documented in KB969266.

Issue #4

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #2

Snapshots were lost

Cause:  Parent VHD was expanded (not supported).  If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

clip_image002

Resolution:  Restore data from backup.

Issue #3

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free disk space to allow the merge to complete or move the .VHD and .AVHD file(s) to a volume with sufficient disk space and manually merge the snapshots.

Integration Components

Issue #1

On Windows Server 2008, when you attempt to install the Integration Components in a Hyper-V virtual machine running Windows Vista Service Pack 2, the installation may fail with the following error:

An error has occurred: One of the update processes returned error code 1.

Cause: This issue occurs if the management operating system (parent partition) that has the Hyper-V role installed does not have Service Pack 2 installed. If you have a virtual machine that’s running Windows Vista Service Pack 2, you need to use the Vmguest.iso from Service Pack 2 to install the Integration Components.

Resolution: Perform the steps documented in KB974503.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred:  The specified program requires a newer version of Windows.

Cause:  Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution:  Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

  • The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )
  • ‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID {7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).
  • The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause:  This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution: Perform the steps documented in KB961804.

Issue #2

Customer has multiple Hyper-V servers and virtual machines are getting duplicate MAC addresses.

Resolution: Configure the Hyper-V servers to use unique MAC address ranges by modifying the MinimumMacAddress and MaximumMacAddress registry values on each Hyper-V server. This issue is documented on TechNet: http://technet.microsoft.com/en-us/library/dd582198(WS.10).aspx. On Server 2008 R2, the MAC address ranges can be configured in the UI.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause:  The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution:  We have a KB article (KB 2000016) which discusses this issue for Windows 2008. On Windows 2008 R2, the experience has improved. If the virtual machine settings are modified within the Failover Cluster Management console, changes that are made to the VM will be saved to the Cluster (i.e. synchronized across all nodes in the cluster). If you make changes to the VM using the Hyper-V Manager Console, you must select the refresh virtual machine configuration option before the VM is moved to another node. This issue is documented in the Windows Server 2008 R2 help file. There is also a blog that discusses this.

Issue #2

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is available which covers how to configure Hyper-V on a Failover Cluster.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

·         If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

·         The Microsoft Hyper-V VSS Writer may enter an unstable state if a backup of the Hyper-V virtual machine fails. If you run the vssadmin list writers command, the Microsoft Hyper-V VSS Writer is not listed. To return the Microsoft Hyper-V VSS Writer to a stable state, the Hyper-V Virtual Machine Management service must be restarted.

Resolution:  An update (KB959962) is available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

After the customer configured a virtual machine to use a VLAN ID, the virtual machine is unable to access the network.

Cause: The VLAN ID used by the virtual machine didn’t match the VLAD ID configured on the network switch.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Issue #3

How to configure a virtual machine to use a VLAN.

Resolution: How to configure a virtual machine to use a VLAN is covered in the Hyper-V Planning and Deployment guide.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution:  The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

·         An Event ID 1054 is logged to the Application Event log:

Event ID: 1054
Source: Userenv
Type: Error
Description:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted). Group Policy processing aborted.

·         A negative ping time is displayed when you use the ping command.

·         Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

BTW – Did I mention we are strongly recommending installing Windows Server 2008 SP2 on all Hyper-V server?  Have a good one!

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Update to KB968912

Hello, my name is Scott McArthur. I am a Senior Support Escalation Engineer in the Windows group and today’s blog will cover a number of issues we have encountered here in support with the following update. These updates will eventually be incorporated into the Knowledge Base article but we wanted to get this information out as soon as possible.

968912 An update is available that allows KMS to provide activation for Windows 7 and for Windows Server 2008 R2: http://support.microsoft.com/default.aspx?scid=kb;EN-US;968912

Issue #1:

Windows Service Pack 2 is a requirement to install this update. If you do not have it installed you receive the following message:

"The update does not apply to your system"

We are correcting the article to include this information.

Issue #2:

Event after installing this update you may encounter the same error message you received prior to installing the update:

Error: 0xc004f050 The Software Licensing service reported that the product key is invalid

This is a known issue we are working on. The workaround is run the following commands in an elevated cmd prompt

Net Stop SLSVC

Net Start SLSVC

Note: A 2nd reboot will also correct this.

The issue is that the reboot of the update occurs before additional licenses get loaded so the service must be restarted to recognize them. We are investigating if we can address this in the update or if the workaround will need to be documented as part of the article.

Issue #3:

One additional caveat with this update is that if you try to install your KMS host key you may receive this error message

0xc004f015: The Software Licensing Service reported that the license is not installed.

SL_E_PRODUCT_SKU_NOT_INSTALLED

This can occur if you are trying to install a KMS host key on the incorrect KMS Host SKU. For example

  • Installing a Client KMS host key on Windows Server 2008
  • Installing a Group B KMS host key on Windows Server 2008 Datacenter KMS host
  • Installing a Group A KMS host key on Windows Server 2008 Standard, Enterprise, or Datacenter KMS host

This is by design. The following lists the type of KMS host key and the operating systems it can be installed on to setup a KMS host.

Group Definitions:

Client: Windows Vista/Windows 7 VL Editions (Business, Enterprise, Professional)

Group A: Windows Server 2008 Web, Windows Server 2008 Web, Windows Server HPC 2008, Windows Server HPC 2008 R2

Group B: Windows Server 2008/2008R2 Enterprise, Windows Server 2008/2008R2 Standard

Group C: Windows Server 2008/2008R2 Datacenter, Windows Server 2008/2008R2 for Itanium Editions

Group

Can be installed on

Can Count

Windows 7 Client

Client

Client

Windows Server 2008R2 KMS_A

A

Client and A

Windows Server 2008R2 KMS_B

A or B

Client, A, and B

Windows Server 2008R2 KMS_C

A or B or C

Client, A, B, and C

Note: KMS 1.2 for Windows Server 2003 can accept any KMS key, and can count the appropriate editions provided you have the following update installed.

968915 An update is available that installs Key Management Service (KMS) 1.2 for Windows Server 2003 Service Pack 2 (SP2) and for later versions of Windows Server 2003: http://support.microsoft.com/default.aspx?scid=kb;EN-US;968915

For more information on Activation, please see the Volume Activation Portal on Technet.

http://technet.microsoft.com/en-us/windows/dd197314.aspx

Hopefully this helps with your deployments and continue to watch our blog for more information activation.

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Working on an Application Compatibility Issue? Let us Help!

Windows 7 is only a few weeks away!  The buzz is building!  However, if your applications aren’t quite ready for Windows 7 (or even Vista) and having you’re having issues, then maybe you’re not quite as excited as I am.  But – there is good news - we may be able to help you out!

Last Monday we launched a new pilot program in our Advisory Services space.  Advisory Services is a consultative support option that provides support beyond standard break-fix issues.  The new program involves remote, phone-based support for issues such as product migration, code review or new program development.  The service is intended for Developers and IT Professionals for shorter engagements that don’t require traditional onsite consulting or sustained account management services available via other Microsoft support options.

For the Application Compatibility engagements, we’ll start off with some basic scoping questions such as whether the application is 16-, 32-, or 64-bit.  Is it a client-server application?  What compatibility issues are you experiencing?  Slow Performance?  Hang or Crash?  Installation problems?  The support engineers will be using tools such as the Application Compatibility Toolkit, the the Standard User Analyzer Wizard, and the Setup Analysis Tool.

There’s much more to the program than I can do justice to in a blog post.  The KB Article referenced below has more details about the program and how to engage us.  So, if you’re working on a pesky Windows Vista or Windows 7 Application Compatibility issue, give us a call – we can help!

Additional Resources:

DPM 2007 - Troubleshooting protection for Hyper-V

This post is about Windows Server 2008 with the Hyper-V role installed, that are being protected by System Center Data Protection Manager 2007.  There may be one or many Virtual Machines on each Host/Parent Partition, and they may be running Windows 2003 and/or Windows 2008.  Supposing the DPM Agent is installed only on the Host/Parent partition of the Hyper-V server, you may find that DPM jobs fail intermittently on the 2003 VM’s, but the 2008 VM’s successfully complete.  The following error may be encountered:

Type: Recovery point
Status: Failed
Description: DPM encountered a retryable VSS error. (ID 30112 Details:
Unknown error (0x800423f3) (0x800423F3))
End time: 4/23/2009 3:37:22 PM
Start time: 4/23/2009 3:36:38 PM
Time elapsed: 00:00:44
Data transferred: 0 MB
Cluster node -
Recovery Point Type Express Full
Source details: \Backup Using Child Partition Snapshot\%ServerName%
Protection group: %ProtectionGroupName%

We found these jobs fail when the Volume Shadow Copy Service (VSS service) on the guest VM is in a “Stopping” state and the only way to get the service in a good condition is to kill the process or reboot the VM.  If the VSS service is in this “Stopping” state the next DPM job will fail.  But if you first verify the VSS service is in a correct state (running or stopped) the DPM job will run successfully.  However, once the DPM job is done you may see the VSS service stuck in the “Stopping” state. This service should automatically stop after 3 minutes of idle time but intermittently it may not stop.  We experienced this behavior across several Hosts and almost all VM’s in a particular environment.  The behavior is random but a few VM’s experience the problem more frequently than others.  We also noticed if the VM is rebooted it will likely work without issues for a few days before the problem re-occurs.


When using vssadmin Windows Server command (see http://technet.microsoft.com/en-us/library/cc754968(WS.10).aspx), it appeared the “Microsoft Hyper-V VSS Writer” on the host was in a “Failed” state with a “Retryable” Last error state when the job fails.  Ordinarily the writer will show a “Stable” state, and “No error” as follows.

image

When the jobs fail, the above command will return:

Writer name: 'Microsoft Hyper-V VSS Writer'
Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
Writer Instance Id: {59f449f9-2413-494d-b679-965bc56129fd}
State: [8] Failed
Last error: Retryable error

After installing Service Pack 2 for Windows 2008, or hotfix KB967560 (see Resolution section below) and running another DPM job on a different VM that has the VSS service in a non-stopping state, the job will run succesfully and place the Hyper-V VSS Writer back into a “Stable” state.

Quick reference for different possible scenarios:


VM in good state + Host in good state = Good backup
VM in bad state + Host in good state = Failed backup
VM in bad state + Host in bad state = Failed job
VM in good state + Host in bad state = Good job

Another possible symptom, when a DPM job is running you may notice on the Hyper-V Management screen next to the VM it displays this message:   “Creating VSS Snapshot Set…”.  This will continue to be displayed and when looking at the Volume Shadow Copy Service inside the W2K3 VM you may notice the service is stopped.  Additionally, when running “vssadmin list writers” on the Host a message is displayed but no writers are visible:

“Waiting for responses. These may be delayed if a shadow copy is being prepared.”

When this condition occurs you may not be able cancel the DPM job.  When trying to cancel the DPM job you may get the following in the detailed pane:


Type: Recovery point
Status: Attempting to cancel
End time: -
Start time: 4/28/2009 10:00:21 AM
Time elapsed: 01:01:22
Data transferred: -
Cluster node -
Recovery Point Type Express Full
Source details: \Backup Using Child Partition Snapshot\%Servername%
Protection group: %ProtectionGroupName%


Looking in the System Event log on the W2K3 VM during the times when the DPM jobs failed it may be clean. But the Application event log may be filled with the VSS errors below:


Event Type: Error
Event Source: VSS
Event Category: None
Event ID: 8193
Date: 2/24/2009
Time: 8:46:34 AM
User: N/A
Computer: %SystemName%
Description:
Volume Shadow Copy Service error: Unexpected error calling routine IEventSystem::Store. hr = 0x80040206.


Event Type: Error
Event Source: VSS
Event Category: None
Event ID: 12302
Date: 2/24/2009
Time: 5:58:49 AM
User: N/A
Computer: %SystemName%
Description:
Volume Shadow Copy Service error: An internal inconsistency was detected in trying to contact shadow copy service writers. Please check to see that the Event Service and Volume Shadow Copy Service are operating properly.


When viewing the System and Application event logs on the DPM server neither have any entries for the same times as the job failures. But the DPM event log may have the following entry:


Event Type: Error
Event Source: DPM-EM
Event Category:None
Event ID: 2
Date: 2/19/2009
Time: 9:20:20 AM
User: N/A
Computer: %DPMServername%
Description:
Creation of recovery points for Backup Using Child Partition Snapshot\%ProtectedServerName-VM% on %HOSTName% have failed. The last recovery point creation failed for the following reason: (ID: 3159) DPM encountered a retryable VSS error. (ID: 30112)
DPM ID: 2^|^%DPMServername%^|^Recovery point creation failures^|^DPM^|^Backup^|^%HOSTName% ^|^a48c6c91-f4ae-4ed3-b5da-a3c22d980a48

 

RESOLUTION

The Hyper-v issue seems to be the result of the underlying state of VSS.  VSS is hung in the "stopping" state because the registry writer is hung attempting to unregister a COM+ event subscription.  This is a subscription for listening for COM messages from other VSS components.  When analyzing the logs captured during the problem it was found the unsubscribe function had been waiting eight minutes when the trace ended (and still had not completed).

It could be that the machine is having COM issues. The VSS service is not going to be successful with processing subsequent jobs until this unsubscribe completes.  If you experience any of the symptoms mentioned above, you should perform all of the action items noted below.


Action Item #1:


Verify all the Prerequisites are met for protecting Hyper-V with DPM:

Prerequisites and Known Issues with Hyper-V Protection
http://technet.microsoft.com/en-us/library/dd347840.aspx

Action Item #2:


Online backups are not possible if any of the following conditions are not met.  Verify that all the W2K3 VM’s meet these requirements:

1.  Hyper-V Integration components is installed and is running the latest version

NOTE: (On the Host/Parent partition you can check VMMS.exe = 6.0.6001.22352 (or newer) and in the guest, check vmbus.sys version 6.0.6001.22334 (or newer)

2. No Dynamic disks inside the guest.

3. All volumes are NTFS

4. All NTFS volumes must be >1GB and have >300MB free space.

5. Shadow copies within the VM are on the same volume or are Disabled

6. VM is in running state.

NOTE: Offline Backups of Windows 2000 Guest VMs fail. Cause: A synthetic SCSI Controller was configured for the VM with no drives attached. Windows 2000 Guests do not support the SCSI Controller, so it is not needed.

Action Item #3:


The root cause of symptoms noted in the Problem section appear to be COM related. After verifying the action items above install the following COM updates:


KB934016 "Availability of Windows Server 2003 Post-Service Pack 2 COM+ 1.5 Hotfix Rollup Package 12"

http://support.microsoft.com/default.aspx?scid=kb;EN-US;934016

KB965230 "FIX: The COM+ Event System does not deliver timely or reliable statistics to subscribers of the IComTrackingInfoEvents event interface in Windows Server 2003"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;965230

KB968447 "The COM+ Event System stops processing the query for matching subscriptions when it detects a corrupted subscription on a Windows Server 2003-based computer"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;968447

Action Item #4:

Install the following two W2K3 VSS updates on the W2K3 virtual machines:


KB940349 “Availability of a Volume Shadow Copy Service (VSS) update rollup package for Windows Server 2003 to resolve some VSS snapshot issues”

http://support.microsoft.com/default.aspx?scid=kb;EN-US;940349


KB969219 “RPC 0x800706ba and 0x800706bf errors occur when backup software tries to create VSS shadow copies on a computer that is running Windows Server 2003 SP2”
http://support.microsoft.com/default.aspx?scid=kb;EN-US;969219


Install the latest VSS/Volsnap update on the W2K3 VM’s. If the Host is also running W2K3 it will be a good idea to also install:
KB967551 “Rollup update for the volsnap.sys driver in Windows Server 2003”
http://support.microsoft.com/default.aspx?scid=kb;EN-US;967551


Action Item #5:


If possible, install W2K8 SP2 since it will include the most recent Hyper-V writer updates. But, there are situations where installing SP2 will not be an option. As an alternative you can install KB967560 and KB971394 on the Windows Server 2008 Host machine.

KB967560 update is more recent then KB959978 which does address a known issue when you run a Windows Server 2003-based virtual machine on a Windows Server 2008 Hyper-V-based computer:


KB967560 “A backup operation fails on a two-node failover cluster that is running Windows Server 2008 after one of the disk resources is moved”

http://support.microsoft.com/default.aspx?scid=kb;EN-US;967560

KB971394 "A backup of virtual machines fails when you use the Hyper-V VSS writer to back up virtual machines concurrently on a computer that is running Windows Server 2008"
http://support.microsoft.com/default.aspx?scid=kb;EN-US;971394


How to obtain the latest service pack for Windows Server 2008
http://support.microsoft.com/kb/968849


ADDITIONAL INFORMATION:


Virtualization with Hyper-V: Supported Guest Operating Systems
http://www.microsoft.com/windowsserver2008/en/us/hyperv-supported-guest-os.aspx

Author:
Tom O’Malley
Microsoft Enterprise Support
Sr. Support Escalation Engineer

 

 

 

Adding New Timezones to Windows XP/Windows Server 2003 Sysprep.inf deployments

Hello, my name is Scott McArthur. I am a Senior Support Escalation Engineer in the Windows group and today’s blog will cover specifying new and updated timezones in sysprep.inf for Windows XP and Windows Server 2003.

When deploying Windows XP or Windows Server 2003 with a sysprep image you must specify the timezone entry. For example

[GuiUnattended]

Timezone=035

The deploy.chm that ships in the deploy.cab has a listing of timezones but there have been changes since those helpfiles were created. For example there is a new time zone called “Morocco Standard Time”. To determine the entry to add do the following

1. Install Windows XP/Windows Server 2003 and install all updates including the latest Daylight Savings Time (DST) update. See http://support.microsoft.com/gp/cp_dst for more information

2. Open regedit.exe

3. Browse to HKLM\Software\Microsoft\Windows NT\CurrentVersion\Time Zones\

4. Choose the timezone you are looking for

5. Under the timezone click the Index registry key and note the decimal number in parenthesis

image

6. In your sysprep.inf add the following

[GuiUnattended]

Timezone=-2147483725

Note: The minus in front of the number needs to be included

If you have the hex value for Index you can also convert it using calc.exe using the following steps

1. Open calc.exe

2. Click Hex

3. Input the hex value. For example 0x8000004d

4. Click Dec

5. You will get the decimal value, 2147483725

Using this process you can add any new timezones to your sysprep.inf. Note that Windows Vista and later uses a different syntax for timezones so this issue only applies to Windows XP and Windows Server 2003

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

How to Use The Migratedatasourcedatafromdpm.Ps1 DPM Powershell Script to Move Data

The Migratedatasourcedatafromdpm.Ps1 DPM Powershell Script  is Included in Service Pack 1 of Data Protection Manager 2007.

The MigrateDatasourceDataFromDPM is a command-line script that lets you migrate DPM data for individual “data source(s)” or all Replica volumes and recovery point volumes to different physical disks. Such a migration might be necessary when your disk is full and cannot be expanded, your disk is due for replacement, or disk errors show up.

Depending on how you have configured your environment, this could mean one of more of the following scenarios for moving data source data:

· DPM Physical disk to another DPM Physical disk

· DPM Data source to different DPM Physical disk

· DPM Data source to Custom volume.

The MigrateDatasourceDataFromDPM script moves all data for a data source or disk to the new volume or physical disk. After migration is complete, the original disk from where the data was migrated from is not chosen for hosting any NEW backups, however the recovery points located on the source disk can be used for restores until the recovery points are expired.

Note: You must retain your old disks until all recovery points on them expire. After the recovery points expire, DPM automatically de-allocates the replicas and recovery point volumes on these disks.

All backup schedules continue to apply and protection of the data source continues as before, but will use the new disk.

After migrating the replica of a data source that has secondary protection enabled, you must start the Modify Protection Group wizard on the secondary DPM server, select the same data source, and complete the wizard. This reconfigures secondary backups to run from the new replica volume on the primary DPM server.

I will walk you through the steps on migrating data source (disk and data) to help you understand what the required commands and the results once the command has completed successfully.

In this first scenario we are going to use the MigrateDatasourceDataFromDPM to conduct a DPM disk to DPM disk migration from start to finish.

In the example below you can see in Disk Manager Disk 1 and Disk 2 is utilized for the DPM storage pool and the replica and recovery volumes are spread across both disks.

clip_image002

From within the DPM UI Protection Group Tab you will see that we have four protection groups with a number of different data sources (Share, SQL, Volume, etc.)

clip_image004

Within the DPM UI Management Tab under Disks you see that we have Disk 1 and Disk 2 allocated to the DPM storage pool

clip_image006

Now we have added two new physical disks to the DPM server which is running Data Protection Manager 2007 SP1, as you will note Disk 3 (4.88GB) and Disk 4 (146.48GB) are listed in Disk Manager and are unallocated and currently basic disks.

clip_image008

After walking through the process of adding Disk 4 as an additional disk to the DPM Storage Pool, you will see that it is now listed in the DPM UI and shows up as 100% unallocated space.

Adding Disks to the Storage Pool

http://technet.microsoft.com/en-us/library/bb795901.aspx

clip_image010

We will now open the DPM command shell and run a command (Get-DPMDisk -DPMServerName <DPM Server Name>) to display the disks.

Get-DPMDisk -DPMServerName RKW2K3-DPM

In order to use the migration powershell command you must use a variable name to hold the array of retured items. In the example below, we have used the variable $disk to hold the Get-DPMDisk -DPMServerName <DPM Server Name> output.

$disk = Get-DPMDisk -DPMServerName RKW2K3-DPM

After running the command you will notice that there are four disks listed, and they are not necessarily arranged in order that disk management lists them. Note that the NTDiskID is the physical disk number (zero based) that disk management lists in the GUI. Note that the NtDiskID are not in numeric order and that disk 0 (windows operating system disk) is not included in the output.

clip_image012

We are now going to use the MigrateDatasourceDataFromDPM.ps1 script to migrate the DPM Physical Disk 1 to Physical Disk 4. ( $disk array element [2] to array element [1] )

(./MigrateDatasourceDataFromDPM.ps1 -DPMServerName <DPM Server Name> -Source $disk[n] -Destination $disk[n])

When using this command the $disk[number] that is used within the brackets is not the NTDiskId but the is the element number in the array list in the $disk variable. This number is always zero based, meaning the 1st element in $disk[0] is physical disk 3 in the above screenshot.

Looking at the output when running the command $disk “DPM Physical Disk 1 is third element in the list starting with 0 this will make Physical Disk 1 = [2] in the list and Physical Disk 4 = [1] in the list so our command will be as follows;

./MigrateDatasourceDataFromDPM.ps1 -DPMServerName RKW2K3-DPM -Source $disk[2] -Destination $disk[1]

clip_image013

The command may take some time depending on the number and size of the volumes on the source disk and once completed you will be back at the DPM Shell prompt.

clip_image015

You will now notice in Disk Management the DPM replica and recovery point volume information which is location on Disk 1 and Disk 2 has been migrated to Disk 4. Any new recovery points for the respective data source will now be located on the new volumes on the new disk, the original volume data on Disk 1 and Disk 2 will still need to be maintained until the recovery point on them expire. Once all recovery points expire on the old disk(s), they will appear as all unallocated free space in disk management, and can then be removed from Windows or be reused.

The MigrateDatasourceDataFromDPM script moves all data for a data source or disk to the new disk or volume. After migration is complete, the original disk from where the data was migrated is not chosen for hosting any new backups. You must retain your old disks until all recovery points on them expire. After the recovery points expire, DPM automatically de-allocates the replicas and recovery point volumes on these disks.

clip_image017

Also since we did a disk migration of Disk 1 to Disk 4, Disk 1 no longer shows up in the DPM UI and will not be used any further for DPM Storage Pool this is normal and is as expected.

clip_image019

After completing the disk to disk migration you will also notice that all of the Protection Groups which used Physical Disk 1 for either or both volumes (replica and Recovery Point) will now show up in DPM as Replica is inconsistent. This is normal and is expected as there has been changes made to the volume and will need to be re-synchronized by running a synchronization job with consistency.

clip_image021

After we have completed the Synchronization job with consistency, all of the Protection groups are now all consistent and up to date and have a Protection Status of OK.

That concludes the Disk to Disk migration, in my next blog we will walk through the process of conducting a Data Source to Disk migration and see how this will help in minimizing the amount of volumes a data source uses.

 

 

Author:
Robert Kierzek
Senior Support Engineer
Microsoft Corporation

 

Why is my 2008 Failover Clustering node blue screening with a Stop 0x0000009E?

John Marlin here from the Windows Cluster Support Team again and today I want to talk about the Stop 0x0000009E and hang detection in Windows Server 2008 Failover Clustering. Just to set some expectations for the blog, I am not going to tell you exactly what the problem is, I am more going to show you what you will be seeing depending on the settings you have in place and what the ramifications are based on your settings. Some would see this as a flaw or a problem caused by Failover Clustering, but I wanted to put you at ease that the blue screen is not because of Failover Clustering. We are just reacting to a hanging or degraded condition that Windows is experiencing.

First, a brief explanation on the hang detection we have for Failover Clustering. The Clustering Service incorporates a detection mechanism that may detect unresponsiveness in user-mode components. This detection is a big deal in the high availability market that no one else incorporates. The Cluster Network Driver monitors the health of the Cluster based on periodic communication between its user-mode and kernel-mode components. Periodic communication between user-mode and kernel-mode is a heartbeat. We will do this and track them through what is called a watchdog timer. This “watchdog” keeps counting from a set number down to zero. If the event it is monitoring occurs before it reaches zero, it resets to the starting number and starts counting down again. If the timer reaches zero, it performs some action that has be predefined or configured.

From a Windows perspective, watchdog timers can detect that basic kernel or user services are not executing. Resource starvation issues (including memory leaks, lock contention, and scheduling priority misconfiguration) can block critical user-mode components without blocking deferred procedure calls (DPCs) or draining the non-paged memory pool.

Kernel components can extend watchdog timer functionality to user mode by periodically monitoring critical applications. This bug check indicates that a user-mode health check failed in a way that prevents graceful shutdown. This bug check restores critical services by restarting or enabling application failover to other servers.

To see what your current Failover Clustering settings for these are, you can run the command:

cluster /cluster:clustername /prop

The Failover Clustering service in has two properties that control the behavior of this:

ClusSvcHangTimeout

This property controls how long we wait between heartbeats before determining that the Cluster Service has stopped responding. The default for the ClusSvcHangTimeout is 60 seconds. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop ClusSvcHangTimeout=x

* where x is in seconds <<-- default is 60 seconds

HangRecoveryAction

This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we actually have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism. 
1 = Logs an event in the system log of the Event Viewer. 
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<-- default for 2008

If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=x

* where x is the action to take

Since HangRecoveryAction=3 (bugcheck the box) is the default, I will start with this one. This setting will actually call into Windows to bugcheck the machine and create a dump file (MEMORY.DMP). The dump file created will be based on the settings in Windows (Kernel Dump as a default). On one hand, you may ask why would I want to blue screen my box and cause a brief production outage? However, on the other hand, if the node is in a hung or degraded state, powering the machine off forcefully may be your only recourse in order to move the services over to another node. When hangs occur, connectivity and or productivity can be severely impacted.

Keep in mind the following scenario of a hung machine. If Failover Clustering detects this problem in say one minute and forces a failover that takes another 2 minutes to bring everything online, you have been down 3 minutes. If this was not in place and this occurred, it may take users several minutes to notice there is some sort of problem. They may wait several more minutes before calling helpdesk to report the problem. Then the helpdesk takes several minutes to log the problem. On it goes before someone can eventually get to the machine to see what is going on. Say they go ahead and hard power off the machine to get your services back into production. What if this took 45 minutes? In a company that values high availability, this additional 42 minutes could have cost you thousands of even millions of dollars!!!

What if it was determined that you needed to get Microsoft involved at this point? What data can you provide? In most cases of hung or degraded machines, the engineer would want the following:

  • System Event Log
  • Application Event Log
  • Performance Log (if any)
  • Pool Monitor Log (if any)
  • Dump file (if any)

If we had not had the setting we have, then you would be left with only the event logs. If nothing is there that points to anything concrete, which seems like most of the time, you would need to configure the system to capture more data and wait for this to happen again. With the Failover Clustering HangRecoveryAction setting in place, then you would have a dump file (snapshot in time) to go through that could point out the cause of the hang and can then correct right now.

So, say you have this problem, what is going to happen is it will bugcheck only the box having this issue and reboot. Because a reboot occurred, all resources that were present on this node are going to move to another and come online to get you back into production. On the reboot of this node, you would see the following event in the System Event Log:

Event Type:  Information
Event ID:  1001
Source:  BugCheck
Description:  The computer has rebooted from a bugcheck.  The bugcheck was 0x0000009E (process id, timeout value, reserved, reserved).

The Stop Error values (in parenthesis) will vary. These are the values of these entries:

process id  =  Process that failed to satisfy a health check within the configured timeout
timeout value  =  Health monitoring timeout (seconds)
reserved  =  will always be zeroes
reserved  =  will always be zeroes

So now we see the event, let's take a look at a dump file. The dump file I am using is from a 64-bit machine.

0: kd> .bugcheck
Bugcheck code 0000009E
Arguments fffffa80`0fdef7e0 00000000`0000003c 00000000`00000000 00000000`00000000

Looking at the Process above, we can see that it is the Cluster Service.

0: kd> !process fffffa800fdef7e0 0
PROCESS fffffa800fdef7e0
    SessionId: 0  Cid: 0a40    Peb: 7fffffd8000  ParentCid: 02e8
    DirBase: 2355da000  ObjectTable: fffff880089cb830  HandleCount: 4288.
    Image: clussvc.exe

Looking at the thread that called the bugcheck, we see this:

0: kd> !thread
THREAD fffff80001dc4b80  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 fffff880000061c0
Owning Process            fffff80001dc50c0       Image:         Idle
Attached Process          fffffa80072d4110       Image:         System
Wait Start TickCount      0              Ticks: 108665 (0:00:28:15.184)
Context Switch Count      5054015            
UserTime                  00:00:00.000
KernelTime                00:20:09.319
Win32 Start Address nt!KiIdleLoop (0xfffff80001caab00)
Stack Init fffff80004331db0 Current fffff80004331d40
Base fffff80004332000 Limit fffff8000432c000 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP          RetAddr           : Args to Child             : Call Site
fffff800`04331a18 fffffa60`011d63c8 : *** removed for space *** : nt!KeBugCheckEx
fffff800`04331a20 fffff800`01ca88b3 : *** removed for space *** : netft!NetftWatchdogTimerDpc+0xb8
fffff800`04331a70 fffff800`01ca9238 : *** removed for space *** : nt!KiTimerListExpire+0x333
fffff800`04331ca0 fffff800`01ca9a9f : *** removed for space *** : nt!KiTimerExpiration+0x1d8
fffff800`04331d10 fffff800`01caab62 : *** removed for space *** : nt!KiRetireDpcList+0x1df
fffff800`04331d80 fffff800`01e785c0 : *** removed for space *** : nt!KiIdleLoop+0x62
fffff800`04331db0 00000000`fffff800 : *** removed for space *** : nt!zzz_AsmCodeRange_End+0x4
fffff800`0432b0b0 00000000`00000000 : *** removed for space *** : 0xfffff800

From a debugging perspective, all we see is that the Cluster Service timed out its health monitoring so called into KeBugCheckEx. One point I wanted to stress again is that even though the Cluster Service created the dump, this is not the cause or focus of your problem resolution steps moving forward. There was something bad occurring with the system that we detected and reacted to. While it may appear extreme, it is one of the better options to ensure availability and faster recovery.

In dumps such as these, you would not want to focus on the Cluster Service and what it was doing, but more from a generic hanging stance. Something in User Mode caused the Failover Clustering Service to become unresponsive, so User Mode processes and general hang debugging is your focus. For this blog, I am not going to go into debugging hang dumps. For more information on debugging hang dumps, you should visit our NTDebugging Blog site for steps, tricks, and tips. Something else to consider is that since we create a dump based on the Windows Crash Settings, the default of kernel dump may or may not show you the exact cause since User Mode Space is not kept. The Crash Setting of Complete Dump may need to be set for any future stop errors.

Let’s look at what happens if you change the HangRecoveryAction to terminate the Cluster Service. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=2

If we get a hang that we detect and need to react to, we would see the following in the System Event Log.

Event ID:  4870
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the Cluster Server process with a process ID '%1', for '%2' seconds. Recovery action will be taken.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

Event ID:  7031
Source:  Service Control Manager
Description:  The Cluster Service service terminated unexpectedly.

If you generate a Cluster Log, you would see the below:

processid:threadid GMT-time [ERR] Watchdog timer timeout for the client process (ID x) and it will terminate the client process.

* where x is the Process ID you would see in Task Manager

At that point, we are going to attempt to terminate the Cluster Service in order to attempt to move everything over to another node so that you can get back to production. When we are terminating the Cluster Service, taking resources offline, sending out notifications, etc, we are going to use user mode space to accomplish some of these tasks. If you have a hang in user mode, we may not be able to complete it. The reality is that the machine is in this degraded/hung state. We are going to try and gracefully recover from this state, and if we cannot, you may be looking at having to hard power the machine off in order to get things properly moved over anyway.

Troubleshooting this may be a more difficult as all you would have to look through would be the Event Logs and a Cluster Log (if generated). The Cluster Log would only show you what is going on with the Cluster, so it most likely may be of no use unless there were actual resource failures prior to the termination. An example would be a File Server resource failure with an Error 1130 (not enough server storage). You would then need to review the System Event Log for any performance type errors (2019 nonpaged pool, 2020 paged pool, etc) or even if any other services may have failed shortly before hand. But even then, you are not going to find the root cause of it. If you were wanting to keep this setting, you would want to look at:

1. Use Task Manager to work with applications or services consuming large amounts of memory
2. Generate a System Diagnostics Report (perfmon /report)
3. Start Resource Monitor (perfmon /res)
4. Open Event Viewer and viewing events related to failover clustering
5. Run Performance Monitor over a longer period of time and look for anything there
5. Any other hanging type monitoring utilities you may use

Now, let’s look at what happens if you change the HangRecoveryAction to simply log an event. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=1

If we get a hang that we detect and need to react to, we would only see the following in the System Event Log.

Event ID: 4869
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the 'C:\Windows\Cluster\clussvc.exe' process with a process ID '%1', for '%2' seconds. Please use Performance Monitor to evaluate the health of the system and determine which process may be negatively impacting the system.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

This is all we are going to do. If a hanging condition is occurring over a long period of time, you could see this event repeat every 60 seconds (or whatever the value you have set for ClusSvcHangTimeout). Since we do not react in any other way, we would basically be at the mercy of Windows and how it reacts. If it hangs, then we may or may not be able to fail anything over. If it not affecting the Cluster Service or any resources, we would just run along like nothing is going on. We could also see problems that do affect the resources and get inadvertant failovers due to loss of communication between the nodes, resource failures, etc. Just like the prior action, you would need to:

1. Use Task Manager to work with applications or services consuming large amounts of memory
2. Generate a System Diagnostics Report (perfmon /report)
3. Start Resource Monitor (perfmon /res)
4. Open Event Viewer and viewing events related to failover clustering
5. Run Performance Monitor over a longer period of time and look for anything there
5. Any other hanging type monitoring utilities you may use

The last action we have is to do disable the health monitor checking. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=0

If we get a hang, then we do nothing as we will detect nothing. Like the action of 1, we are only going to do anything if it actually causes us communication issues between the nodes or causes resources to actually fail. We will react to that, but that would be it.

I hope that this gives you a better knowledge and understanding of this feature. Remember, just because we create a dump or terminate the service, does not mean that Failover Clustering actually caused the issue or the downtime. On the contrary, Failover Clustering just reacted based on what the hang detection settings are and gets you back up into production quicker with the benefit of additional data that can be reviewed to assist getting a resolution of the true problem. Look at this from a performance perspective and treat it as you would any other stand-alone system that has sluggishness, hangs, etc.

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Running Hyper-V in a lab? Use Snapshots? Check this out!

The Hyper-V Snapshot feature(Checkpoint in SCVMM) is a very useful feature for Support Engineers. This allows us to revert the VM to a previous state irrespective of the local* changes you’ve made after the snapshot was taken. Working with customers on a daily basis necessitates having a system on which you can mirror the customer’s setup.

However, one frustrating issue you will experience eventually, if you haven’t already, is that on applying some snapshots, you’re no longer able to log into the domain. Disjoining/Rejoining isn’t something you want to do when you need to test something quickly. To briefly explain what happens here, assume that a VM has it’s machine account password set to A. This is stored both locally as well as in the machine account in Active Directory. You take a snapshot of this VM and forget about it. The VM, as it chugs along, determines that it’s time to change its machine account password and goes ahead and does this. The VM sets its password set to B both locally as well as in Active Directory. Now, you’ve decided to do some testing on this VM and Ka-boom! You’ve blown it to bits(though only locally, as stated before). You suddenly remember that you’ve got a snapshot. Lucky you! You apply it and believe everything’s going to be okay. And then you can’t log into the domain. Why? Because the VM is attempting to contact a domain controller using password A, which is no longer valid. The authenticating domain controller expects password B, but the VM is sending it A. That is pretty much all there is to it.

Enter DisablePasswordChange. This registry setting, which can be set using Group Policy prevents the system from changing its machine account password with the domain controller every 30 days(by default).

At this stage, you’re probably thinking that preventing regular password change isn’t a good thing security-wise. You’re correct, it isn’t. However, in an isolated test environment(where all systems, domain controllers and domain members are VMs), the tradeoff is acceptable.

Here’s what you need to do to set this up on all systems in your VM Domain:

1. Create a new GPO on the VM Domain(so that it applies to all Domain member systems in the Domain) and name it, say, Disable Machine Account Password Changes so that it is easily locatable.

2. Edit it and make the following setting:

clip_image001

3. This GPO setting will percolate to all the domain members(If there are no group policy errors) and take effect.

Snapshots that are taken after this setting is effective will have a much longer shelf life than those taken before and you can apply essentially any snapshot!

* Local changes mean only those which are completely local to the system. For example, a domain join or disjoin is not a completely local change since the machine account is created on a domain controller. Deleting all printers on a print server is an example of a local change.

Note: Snapshots should never be used for domain controllers as domain controllers contain common information(that is, Active Directory) that is replicated between each other. There are a variety of issues that you can run into, such as a USN Rollback.

Richard Spitz
Support Engineer
Microsoft Enterprise Platforms Support

More Posts Next page »
Page view tracker