Welcome to TechNet Blogs Sign in | Join | Help

Ask the Core Team

Microsoft Enterprise Support Windows Server Core Team

News

  • Disclaimer: All postings are provided "AS IS" with no warranties, and confer no rights. This weblog does not represent the thoughts, intentions, plans or strategies of Microsoft. Because a weblog is intended to provide a semi-permanent point-in-time snapshot, you should not consider out of date posts to reflect current thoughts and opinions.

    Locations of visitors to this page
How to Use The Migratedatasourcedatafromdpm.Ps1 DPM Powershell Script to Move Data

The Migratedatasourcedatafromdpm.Ps1 DPM Powershell Script  is Included in Service Pack 1 of Data Protection Manager 2007.

The MigrateDatasourceDataFromDPM is a command-line script that lets you migrate DPM data for individual “data source(s)” or all Replica volumes and recovery point volumes to different physical disks. Such a migration might be necessary when your disk is full and cannot be expanded, your disk is due for replacement, or disk errors show up.

Depending on how you have configured your environment, this could mean one of more of the following scenarios for moving data source data:

· DPM Physical disk to another DPM Physical disk

· DPM Data source to different DPM Physical disk

· DPM Data source to Custom volume.

The MigrateDatasourceDataFromDPM script moves all data for a data source or disk to the new volume or physical disk. After migration is complete, the original disk from where the data was migrated from is not chosen for hosting any NEW backups, however the recovery points located on the source disk can be used for restores until the recovery points are expired.

Note: You must retain your old disks until all recovery points on them expire. After the recovery points expire, DPM automatically de-allocates the replicas and recovery point volumes on these disks.

All backup schedules continue to apply and protection of the data source continues as before, but will use the new disk.

After migrating the replica of a data source that has secondary protection enabled, you must start the Modify Protection Group wizard on the secondary DPM server, select the same data source, and complete the wizard. This reconfigures secondary backups to run from the new replica volume on the primary DPM server.

I will walk you through the steps on migrating data source (disk and data) to help you understand what the required commands and the results once the command has completed successfully.

In this first scenario we are going to use the MigrateDatasourceDataFromDPM to conduct a DPM disk to DPM disk migration from start to finish.

In the example below you can see in Disk Manager Disk 1 and Disk 2 is utilized for the DPM storage pool and the replica and recovery volumes are spread across both disks.

clip_image002

From within the DPM UI Protection Group Tab you will see that we have four protection groups with a number of different data sources (Share, SQL, Volume, etc.)

clip_image004

Within the DPM UI Management Tab under Disks you see that we have Disk 1 and Disk 2 allocated to the DPM storage pool

clip_image006

Now we have added two new physical disks to the DPM server which is running Data Protection Manager 2007 SP1, as you will note Disk 3 (4.88GB) and Disk 4 (146.48GB) are listed in Disk Manager and are unallocated and currently basic disks.

clip_image008

After walking through the process of adding Disk 4 as an additional disk to the DPM Storage Pool, you will see that it is now listed in the DPM UI and shows up as 100% unallocated space.

Adding Disks to the Storage Pool

http://technet.microsoft.com/en-us/library/bb795901.aspx

clip_image010

We will now open the DPM command shell and run a command (Get-DPMDisk -DPMServerName <DPM Server Name>) to display the disks.

Get-DPMDisk -DPMServerName RKW2K3-DPM

In order to use the migration powershell command you must use a variable name to hold the array of retured items. In the example below, we have used the variable $disk to hold the Get-DPMDisk -DPMServerName <DPM Server Name> output.

$disk = Get-DPMDisk -DPMServerName RKW2K3-DPM

After running the command you will notice that there are four disks listed, and they are not necessarily arranged in order that disk management lists them. Note that the NTDiskID is the physical disk number (zero based) that disk management lists in the GUI. Note that the NtDiskID are not in numeric order and that disk 0 (windows operating system disk) is not included in the output.

clip_image012

We are now going to use the MigrateDatasourceDataFromDPM.ps1 script to migrate the DPM Physical Disk 1 to Physical Disk 4. ( $disk array element [2] to array element [1] )

(./MigrateDatasourceDataFromDPM.ps1 -DPMServerName <DPM Server Name> -Source $disk[n] -Destination $disk[n])

When using this command the $disk[number] that is used within the brackets is not the NTDiskId but the is the element number in the array list in the $disk variable. This number is always zero based, meaning the 1st element in $disk[0] is physical disk 3 in the above screenshot.

Looking at the output when running the command $disk “DPM Physical Disk 1 is third element in the list starting with 0 this will make Physical Disk 1 = [2] in the list and Physical Disk 4 = [1] in the list so our command will be as follows;

./MigrateDatasourceDataFromDPM.ps1 -DPMServerName RKW2K3-DPM -Source $disk[2] -Destination $disk[1]

clip_image013

The command may take some time depending on the number and size of the volumes on the source disk and once completed you will be back at the DPM Shell prompt.

clip_image015

You will now notice in Disk Management the DPM replica and recovery point volume information which is location on Disk 1 and Disk 2 has been migrated to Disk 4. Any new recovery points for the respective data source will now be located on the new volumes on the new disk, the original volume data on Disk 1 and Disk 2 will still need to be maintained until the recovery point on them expire. Once all recovery points expire on the old disk(s), they will appear as all unallocated free space in disk management, and can then be removed from Windows or be reused.

The MigrateDatasourceDataFromDPM script moves all data for a data source or disk to the new disk or volume. After migration is complete, the original disk from where the data was migrated is not chosen for hosting any new backups. You must retain your old disks until all recovery points on them expire. After the recovery points expire, DPM automatically de-allocates the replicas and recovery point volumes on these disks.

clip_image017

Also since we did a disk migration of Disk 1 to Disk 4, Disk 1 no longer shows up in the DPM UI and will not be used any further for DPM Storage Pool this is normal and is as expected.

clip_image019

After completing the disk to disk migration you will also notice that all of the Protection Groups which used Physical Disk 1 for either or both volumes (replica and Recovery Point) will now show up in DPM as Replica is inconsistent. This is normal and is expected as there has been changes made to the volume and will need to be re-synchronized by running a synchronization job with consistency.

clip_image021

After we have completed the Synchronization job with consistency, all of the Protection groups are now all consistent and up to date and have a Protection Status of OK.

That concludes the Disk to Disk migration, in my next blog we will walk through the process of conducting a Data Source to Disk migration and see how this will help in minimizing the amount of volumes a data source uses.

 

 

Author:
Robert Kierzek
Senior Support Engineer
Microsoft Corporation

 

Why is my 2008 Failover Clustering node blue screening with a Stop 0x0000009E?

John Marlin here from the Windows Cluster Support Team again and today I want to talk about the Stop 0x0000009E and hang detection in Windows Server 2008 Failover Clustering. Just to set some expectations for the blog, I am not going to tell you exactly what the problem is, I am more going to show you what you will be seeing depending on the settings you have in place and what the ramifications are based on your settings. Some would see this as a flaw or a problem caused by Failover Clustering, but I wanted to put you at ease that the blue screen is not because of Failover Clustering. We are just reacting to a hanging or degraded condition that Windows is experiencing.

First, a brief explanation on the hang detection we have for Failover Clustering. The Clustering Service incorporates a detection mechanism that may detect unresponsiveness in user-mode components. This detection is a big deal in the high availability market that no one else incorporates. The Cluster Network Driver monitors the health of the Cluster based on periodic communication between its user-mode and kernel-mode components. Periodic communication between user-mode and kernel-mode is a heartbeat. We will do this and track them through what is called a watchdog timer. This “watchdog” keeps counting from a set number down to zero. If the event it is monitoring occurs before it reaches zero, it resets to the starting number and starts counting down again. If the timer reaches zero, it performs some action that has be predefined or configured.

From a Windows perspective, watchdog timers can detect that basic kernel or user services are not executing. Resource starvation issues (including memory leaks, lock contention, and scheduling priority misconfiguration) can block critical user-mode components without blocking deferred procedure calls (DPCs) or draining the non-paged memory pool.

Kernel components can extend watchdog timer functionality to user mode by periodically monitoring critical applications. This bug check indicates that a user-mode health check failed in a way that prevents graceful shutdown. This bug check restores critical services by restarting or enabling application failover to other servers.

To see what your current Failover Clustering settings for these are, you can run the command:

cluster /cluster:clustername /prop

The Failover Clustering service in has two properties that control the behavior of this:

ClusSvcHangTimeout

This property controls how long we wait between heartbeats before determining that the Cluster Service has stopped responding. The default for the ClusSvcHangTimeout is 60 seconds. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop ClusSvcHangTimeout=x

* where x is in seconds <<-- default is 60 seconds

HangRecoveryAction

This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we actually have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism. 
1 = Logs an event in the system log of the Event Viewer. 
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node.  <<-- default for 2008

If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=x

* where x is the action to take

Since HangRecoveryAction=3 (bugcheck the box) is the default, I will start with this one. This setting will actually call into Windows to bugcheck the machine and create a dump file (MEMORY.DMP). The dump file created will be based on the settings in Windows (Kernel Dump as a default). On one hand, you may ask why would I want to blue screen my box and cause a brief production outage? However, on the other hand, if the node is in a hung or degraded state, powering the machine off forcefully may be your only recourse in order to move the services over to another node. When hangs occur, connectivity and or productivity can be severely impacted.

Keep in mind the following scenario of a hung machine. If Failover Clustering detects this problem in say one minute and forces a failover that takes another 2 minutes to bring everything online, you have been down 3 minutes. If this was not in place and this occurred, it may take users several minutes to notice there is some sort of problem. They may wait several more minutes before calling helpdesk to report the problem. Then the helpdesk takes several minutes to log the problem. On it goes before someone can eventually get to the machine to see what is going on. Say they go ahead and hard power off the machine to get your services back into production. What if this took 45 minutes? In a company that values high availability, this additional 42 minutes could have cost you thousands of even millions of dollars!!!

What if it was determined that you needed to get Microsoft involved at this point? What data can you provide? In most cases of hung or degraded machines, the engineer would want the following:

  • System Event Log
  • Application Event Log
  • Performance Log (if any)
  • Pool Monitor Log (if any)
  • Dump file (if any)

If we had not had the setting we have, then you would be left with only the event logs. If nothing is there that points to anything concrete, which seems like most of the time, you would need to configure the system to capture more data and wait for this to happen again. With the Failover Clustering HangRecoveryAction setting in place, then you would have a dump file (snapshot in time) to go through that could point out the cause of the hang and can then correct right now.

So, say you have this problem, what is going to happen is it will bugcheck only the box having this issue and reboot. Because a reboot occurred, all resources that were present on this node are going to move to another and come online to get you back into production. On the reboot of this node, you would see the following event in the System Event Log:

Event Type:  Information
Event ID:  1001
Source:  BugCheck
Description:  The computer has rebooted from a bugcheck.  The bugcheck was 0x0000009E (process id, timeout value, reserved, reserved).

The Stop Error values (in parenthesis) will vary. These are the values of these entries:

process id  =  Process that failed to satisfy a health check within the configured timeout
timeout value  =  Health monitoring timeout (seconds)
reserved  =  will always be zeroes
reserved  =  will always be zeroes

So now we see the event, let's take a look at a dump file. The dump file I am using is from a 64-bit machine.

0: kd> .bugcheck
Bugcheck code 0000009E
Arguments fffffa80`0fdef7e0 00000000`0000003c 00000000`00000000 00000000`00000000

Looking at the Process above, we can see that it is the Cluster Service.

0: kd> !process fffffa800fdef7e0 0
PROCESS fffffa800fdef7e0
    SessionId: 0  Cid: 0a40    Peb: 7fffffd8000  ParentCid: 02e8
    DirBase: 2355da000  ObjectTable: fffff880089cb830  HandleCount: 4288.
    Image: clussvc.exe

Looking at the thread that called the bugcheck, we see this:

0: kd> !thread
THREAD fffff80001dc4b80  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 fffff880000061c0
Owning Process            fffff80001dc50c0       Image:         Idle
Attached Process          fffffa80072d4110       Image:         System
Wait Start TickCount      0              Ticks: 108665 (0:00:28:15.184)
Context Switch Count      5054015            
UserTime                  00:00:00.000
KernelTime                00:20:09.319
Win32 Start Address nt!KiIdleLoop (0xfffff80001caab00)
Stack Init fffff80004331db0 Current fffff80004331d40
Base fffff80004332000 Limit fffff8000432c000 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP          RetAddr           : Args to Child             : Call Site
fffff800`04331a18 fffffa60`011d63c8 : *** removed for space *** : nt!KeBugCheckEx
fffff800`04331a20 fffff800`01ca88b3 : *** removed for space *** : netft!NetftWatchdogTimerDpc+0xb8
fffff800`04331a70 fffff800`01ca9238 : *** removed for space *** : nt!KiTimerListExpire+0x333
fffff800`04331ca0 fffff800`01ca9a9f : *** removed for space *** : nt!KiTimerExpiration+0x1d8
fffff800`04331d10 fffff800`01caab62 : *** removed for space *** : nt!KiRetireDpcList+0x1df
fffff800`04331d80 fffff800`01e785c0 : *** removed for space *** : nt!KiIdleLoop+0x62
fffff800`04331db0 00000000`fffff800 : *** removed for space *** : nt!zzz_AsmCodeRange_End+0x4
fffff800`0432b0b0 00000000`00000000 : *** removed for space *** : 0xfffff800

From a debugging perspective, all we see is that the Cluster Service timed out its health monitoring so called into KeBugCheckEx. One point I wanted to stress again is that even though the Cluster Service created the dump, this is not the cause or focus of your problem resolution steps moving forward. There was something bad occurring with the system that we detected and reacted to. While it may appear extreme, it is one of the better options to ensure availability and faster recovery.

In dumps such as these, you would not want to focus on the Cluster Service and what it was doing, but more from a generic hanging stance. Something in User Mode caused the Failover Clustering Service to become unresponsive, so User Mode processes and general hang debugging is your focus. For this blog, I am not going to go into debugging hang dumps. For more information on debugging hang dumps, you should visit our NTDebugging Blog site for steps, tricks, and tips. Something else to consider is that since we create a dump based on the Windows Crash Settings, the default of kernel dump may or may not show you the exact cause since User Mode Space is not kept. The Crash Setting of Complete Dump may need to be set for any future stop errors.

Let’s look at what happens if you change the HangRecoveryAction to terminate the Cluster Service. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=2

If we get a hang that we detect and need to react to, we would see the following in the System Event Log.

Event ID:  4870
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the Cluster Server process with a process ID '%1', for '%2' seconds. Recovery action will be taken.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

Event ID:  7031
Source:  Service Control Manager
Description:  The Cluster Service service terminated unexpectedly.

If you generate a Cluster Log, you would see the below:

processid:threadid GMT-time [ERR] Watchdog timer timeout for the client process (ID x) and it will terminate the client process.

* where x is the Process ID you would see in Task Manager

At that point, we are going to attempt to terminate the Cluster Service in order to attempt to move everything over to another node so that you can get back to production. When we are terminating the Cluster Service, taking resources offline, sending out notifications, etc, we are going to use user mode space to accomplish some of these tasks. If you have a hang in user mode, we may not be able to complete it. The reality is that the machine is in this degraded/hung state. We are going to try and gracefully recover from this state, and if we cannot, you may be looking at having to hard power the machine off in order to get things properly moved over anyway.

Troubleshooting this may be a more difficult as all you would have to look through would be the Event Logs and a Cluster Log (if generated). The Cluster Log would only show you what is going on with the Cluster, so it most likely may be of no use unless there were actual resource failures prior to the termination. An example would be a File Server resource failure with an Error 1130 (not enough server storage). You would then need to review the System Event Log for any performance type errors (2019 nonpaged pool, 2020 paged pool, etc) or even if any other services may have failed shortly before hand. But even then, you are not going to find the root cause of it. If you were wanting to keep this setting, you would want to look at:

1. Use Task Manager to work with applications or services consuming large amounts of memory
2. Generate a System Diagnostics Report (perfmon /report)
3. Start Resource Monitor (perfmon /res)
4. Open Event Viewer and viewing events related to failover clustering
5. Run Performance Monitor over a longer period of time and look for anything there
5. Any other hanging type monitoring utilities you may use

Now, let’s look at what happens if you change the HangRecoveryAction to simply log an event. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=1

If we get a hang that we detect and need to react to, we would only see the following in the System Event Log.

Event ID: 4869
Source:  Microsoft-Windows-FailoverClustering
Description:  User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the 'C:\Windows\Cluster\clussvc.exe' process with a process ID '%1', for '%2' seconds. Please use Performance Monitor to evaluate the health of the system and determine which process may be negatively impacting the system.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

This is all we are going to do. If a hanging condition is occurring over a long period of time, you could see this event repeat every 60 seconds (or whatever the value you have set for ClusSvcHangTimeout). Since we do not react in any other way, we would basically be at the mercy of Windows and how it reacts. If it hangs, then we may or may not be able to fail anything over. If it not affecting the Cluster Service or any resources, we would just run along like nothing is going on. We could also see problems that do affect the resources and get inadvertant failovers due to loss of communication between the nodes, resource failures, etc. Just like the prior action, you would need to:

1. Use Task Manager to work with applications or services consuming large amounts of memory
2. Generate a System Diagnostics Report (perfmon /report)
3. Start Resource Monitor (perfmon /res)
4. Open Event Viewer and viewing events related to failover clustering
5. Run Performance Monitor over a longer period of time and look for anything there
5. Any other hanging type monitoring utilities you may use

The last action we have is to do disable the health monitor checking. If you want to change the setting, you would issue the command:

cluster /cluster:clustername /prop HangRecoveryAction=0

If we get a hang, then we do nothing as we will detect nothing. Like the action of 1, we are only going to do anything if it actually causes us communication issues between the nodes or causes resources to actually fail. We will react to that, but that would be it.

I hope that this gives you a better knowledge and understanding of this feature. Remember, just because we create a dump or terminate the service, does not mean that Failover Clustering actually caused the issue or the downtime. On the contrary, Failover Clustering just reacted based on what the hang detection settings are and gets you back up into production quicker with the benefit of additional data that can be reviewed to assist getting a resolution of the true problem. Look at this from a performance perspective and treat it as you would any other stand-alone system that has sluggishness, hangs, etc.

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Running Hyper-V in a lab? Use Snapshots? Check this out!

The Hyper-V Snapshot feature(Checkpoint in SCVMM) is a very useful feature for Support Engineers. This allows us to revert the VM to a previous state irrespective of the local* changes you’ve made after the snapshot was taken. Working with customers on a daily basis necessitates having a system on which you can mirror the customer’s setup.

However, one frustrating issue you will experience eventually, if you haven’t already, is that on applying some snapshots, you’re no longer able to log into the domain. Disjoining/Rejoining isn’t something you want to do when you need to test something quickly. To briefly explain what happens here, assume that a VM has it’s machine account password set to A. This is stored both locally as well as in the machine account in Active Directory. You take a snapshot of this VM and forget about it. The VM, as it chugs along, determines that it’s time to change its machine account password and goes ahead and does this. The VM sets its password set to B both locally as well as in Active Directory. Now, you’ve decided to do some testing on this VM and Ka-boom! You’ve blown it to bits(though only locally, as stated before). You suddenly remember that you’ve got a snapshot. Lucky you! You apply it and believe everything’s going to be okay. And then you can’t log into the domain. Why? Because the VM is attempting to contact a domain controller using password A, which is no longer valid. The authenticating domain controller expects password B, but the VM is sending it A. That is pretty much all there is to it.

Enter DisablePasswordChange. This registry setting, which can be set using Group Policy prevents the system from changing its machine account password with the domain controller every 30 days(by default).

At this stage, you’re probably thinking that preventing regular password change isn’t a good thing security-wise. You’re correct, it isn’t. However, in an isolated test environment(where all systems, domain controllers and domain members are VMs), the tradeoff is acceptable.

Here’s what you need to do to set this up on all systems in your VM Domain:

1. Create a new GPO on the VM Domain(so that it applies to all Domain member systems in the Domain) and name it, say, Disable Machine Account Password Changes so that it is easily locatable.

2. Edit it and make the following setting:

clip_image001

3. This GPO setting will percolate to all the domain members(If there are no group policy errors) and take effect.

Snapshots that are taken after this setting is effective will have a much longer shelf life than those taken before and you can apply essentially any snapshot!

* Local changes mean only those which are completely local to the system. For example, a domain join or disjoin is not a completely local change since the machine account is created on a domain controller. Deleting all printers on a print server is an example of a local change.

Note: Snapshots should never be used for domain controllers as domain controllers contain common information(that is, Active Directory) that is replicated between each other. There are a variety of issues that you can run into, such as a USN Rollback.

Richard Spitz
Support Engineer
Microsoft Enterprise Platforms Support

Expected Snapshot Merge Behavior for a Highly Available VM

Hello my name is Sean Dwyer, and I'm a Support Escalation Engineer working in the Windows CORE team here at Microsoft.

While working with snapshots that are attached to a highly available VM in a Cluster, you may notice after deleting a snapshot, it does not merge as expected. You'll notice the merge process almost immediately ends and the VM begins to restart.

Let's explore a scenario I ran myself into the other day and then I'll explain why you may see this behavior.

Fig. 1.

clip_image001

Here I have configured a HA-VM of XP SP2 in Fig 1. and it is Online and happy, within the Cluster.

Fig. 2.

clip_image002
Switching to the Hyper-V Management MMC in Fig. 2, I've decided that my testing is over with this VM, and I need to clean up the snapshots. The 'Now' state is where I want the VM to remain, so I'm going to issue a Delete Snapshot Subtree command.

Fig. 3.

clip_image003

I choose 'Yes', and the snapshots are removed from the VM in Fig. 3.
Next, I choose to shut down the VM in Fig. 4, as I want to commit the merge now, and not wait until later.

Fig. 4.

clip_image004

The Merge process starts, but immediately finishes and the VM is restarting as seen in Fig. 5 and Fig. 6.

Fig. 5.

clip_image005

Fig. 6.

clip_image006

The merge didn't complete! What just happened?

Failover Cluster is doing its job!

When deleting a snapshot for a VM that is configured as a Highly Available Resource, you're taking action against the VM outside of the Cluster Management UI, and therefore, the Cluster service considers any shutdown events, or reboots, a failure of the resource and takes appropriate action.

In my example, the default behavior is restarting the VM as shown in Fig.7.

Fig. 7.

clip_image007

The end result is that the snapshot merge process has NOT taken place and I've still got AVHDs out there shown in Fig. 8.

Fig. 8.

clip_image008

So, what do we do, to get the Snapshots merged successfully? Easy!

We need to configure the Clustered VM Resource to allow the Snapshot Merge process to complete without being interrupted.

Open up the Properties of the VM, and then select the Offline Actions Tab.

You'll note, in Fig. 9, the default behavior is to Save the VM.

Fig. 9.

clip_image009

In Fig. 10, let's change this default behavior temporarily to Shutdown.

Fig. 10.

clip_image010 

Once the change has been made, you can then choose to shutdown the VM either through Hyper-V Manager or through the Clustering UI, and the Snapshot Merge process will begin.

Once the Merge completes, reset the behavior of the VM in the Offline Actions Tab for the VM back to Save as shown in Fig. 11.

Fig. 11.

clip_image009 

Once you start the VM, you'll be back in business!

I hope this blog post will help you continue to use our Clustering and Virtualization products successfully!

Sean Dwyer
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Top Issues for Microsoft Support for Windows Server 2008 Hyper-V (Q3)

It is time to update everyone on the issues our support engineers have been seeing for Hyper-V for the past quarter.  The issues are categorized below with the top issue(s) in each category listed with possible resolutions and additional comments as needed.  I think you will notice that the issues for Q3 have not changed much from Q1\Q2.  Hopefully, the more people read our updates, the fewer occurrences we will see for some of these and eventually they will disappear altogether.  There will probably be one more blog for the Q4 results.  Additionally, I would like to mention that we are highly recommending the installation of Windows Server 2008 Service Pack 2 on all servers running the Hyper-V Role.

Deployment\Planning

Issue #1

Customers looking for Hyper-V documentation.

Resolution:  Information is provided on the Hyper-V TechNet Library which includes links to several Product Team blogs.  Additionally, the Microsoft Virtualization site contains information that can be used to get a Hyper-V based solution up and running quickly.

Installation Issues

Issue #1

A customer was experiencing an issue on a pre-release version of Hyper-V.

Resolution: Upgrade to the release version (KB950050) of Hyper-V.

Issue #2

After the latest updates off Windows Update are installed or KB950050 is installed, virtual machines fail to start with one of the following error messages:

An error occurred while attempting to chance the state of the virtual machine vmname .
vmname ’ failed to initialize.
Failed to read or update VM configuration.

or

An error occurred while attempting to change the state of virtual machine vmname .
" VMName " failed to initialize
An attempt to read or update the virtual machine configuration failed.
" VMName " failed to read or update the virtual machine configuration: Unspecified error (0x80040005).

Cause: This issue occurs because virtual machine configurations that were created in the beta version of the Hyper-V are incompatible with later versions of the Hyper-V.

Resolution: Perform the steps documented in KB949222.

Issue #3

After the Hyper-V role is installed, a customer creates a virtual machine but it fails to start with the following error:

The virtual machine could not be started because the hypervisor is not running

Cause: Hardware virtualization or DEP was disabled in the BIOS.

Resolution: Enable Hardware virtualization or DEP in the BIOS.  In some cases, the server may need to be physically shutdown in order for the new BIOS settings to take effect.

Virtual Devices\Drivers

Issue #1

Synthetic NIC was listed as an unknown device in device manager.

Cause: Integration Components needed to be installed.

Resolution: Install Integration Components (IC) package in the VM.

Issue #2

Stop 0x00000050 on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: This issue can occur if a Hyper-V virtual machine is configured with a SCSI controller but no disks are attached.

Resolution: Perform the steps documented in KB969266.

Issue #3

Stop 0x0000001A on a Microsoft Hyper-V Server 2008 or Server 2008 system with the Hyper-V role installed.

Cause: Vid.sys

Resolution: Install hotfix KB957967 to address this issue.

Snapshots

Issue #1

Snapshots fail to merge with error 0x80070070

Cause: Low disk space.

Resolution: Free up disk space to allow the merge to complete.

Issue #2

Snapshots were deleted

Cause: The most common cause is that a customer deleted the .avhd files to reclaim disk space (not realizing that the .avhd files were the snapshots).

Resolution: Restore data from backup.

For more information on Snapshots, please refer to the Snapshot FAQ: http://technet.microsoft.com/en-us/library/dd560637.aspx.

Issue #3

Snapshots were lost

Cause:  Parent VHD was expanded (not supported).  If snapshots are associated with a virtual hard disk, the parent vhd file should never be expanded. This is documented in the Edit Disk wizard:

clip_image002

Resolution:  Restore data from backup.

Integration Components

Issue #1

A Windows 2000 (SP4) virtual machine with the Integration Components installed may shut down slowly.

Cause:  This problem is caused by a bug in the Windows Software Trace Pre-Processor (WPP) tracing macro (outside of Hyper-V).

Resolution:  KB959781 documents the workarounds for this issue on Server 2008.

Issue #2

Attempting to install the Integration Components on a Server 2003 virtual machine fails with the following error:

Unsupported Guest OS

An error has occurred:  The specified program requires a newer version of Windows.

Cause:  Service Pack 2 for Server 2003 wasn’t installed in the virtual machine.

Resolution:  Install SP2 in the Server 2003 VM before installing the integration components.

Virtual machine State and Settings

Issue #1

You may experience one of the following issues on a Windows Server 2008 system with the Hyper-V role installed or Microsoft Hyper-V Server 2008:

When you attempt to create or start a virtual machine, you receive one of the following errors:

·         The requested operation cannot be performed on a file with a user-mapped section open. ( 0x800704C8 )

·         ‘VMName’ Microsoft Synthetic Ethernet Port (Instance ID

{7E0DA81A-A7B4-4DFD-869F-37002C36D816}): Failed to Power On with Error 'The specified network resource or device is no longer available.' (0x80070037).

·         The I/O operation has been aborted because of either a thread exit or an application request. (0x800703E3)

Virtual machines disappear from the Hyper-V Management Console.

Cause:  This issue can be caused by antivirus software that is installed in the parent partition and the real-time scanning component is configured to monitor the Hyper-V virtual machine files.

Resolution:  Perform the steps documented in KB961804.

Issue #2

Creating or starting a virtual machine fails with the following error:

'General access denied error' (0x80070005).

Cause:  This issue can be caused by the Intel IPMI driver.

Resolution:  Perform the steps documented in KB969556.

Issue #3

Virtual machines have a state of "Paused-Critical"

Cause: Lack of free disk space on the volume hosting the .vhd or .avhd files.

Resolution: Free up disk space on the volume hosting the .vhd or .avhd files.

High Availability (Failover Clustering)

Issue #1

How to configure Hyper-V on a Failover Cluster.

Resolution: A step-by-step guide is now available which covers how to configure Hyper-V on a Failover Cluster.

Issue #2

Virtual machine settings that are changed on one node in a Failover Cluster are not present when the VM is moved to another node in the cluster.

Cause:  The "Refresh virtual machine configuration" option was not used before attempting a failover.

Resolution:  When virtual machine settings are changed on a VM that’s on a Failover Cluster, you must select the ‘Refresh virtual machine configuration’ option before the VM is moved to another node.  There is a blog that discusses this.

Backup (Hyper-V VSS Writer)

Issue #1

You may experience one of the following symptoms if you try to backup a Hyper-V virtual machine:

·         If you back up a Hyper-V virtual machine that has multiple volumes, the backup may fail. If you check the VMMS event log after the backup failure occurs, the following event is logged:

Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin

Source: Microsoft-Windows-Hyper-V-VMMS

Event ID: 10104

Level: Error

Description:

Failed to revert to VSS snapshot on one or more virtual hard disks of the virtual machine '%1'. (Virtual machine ID %2)

·         The Microsoft Hyper-V VSS Writer may enter an unstable state if a backup of the Hyper-V virtual machine fails. If you run the vssadmin list writers command, the Microsoft Hyper-V VSS Writer is not listed. To return the Microsoft Hyper-V VSS Writer to a stable state, the Hyper-V Virtual Machine Management service must be restarted.

Resolution:  An update (KB959962) is now available to address issues with backing up and restoring Hyper-V virtual machines.

Issue #2

How to backup virtual machines using Windows Server Backup

Resolution: Perform the steps documented in KB958662.

Virtual Network Manager

Issue #1

Virtual machines are unable to access the external network.

Cause: The virtual network was configured to use the wrong physical NIC.

Resolution: Configure the external network to use the correct NIC.

Issue #2

Network connectivity issues

Cause: NIC teaming software

Resolution: Remove the NIC teaming software. Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Issue #3

Customers inquiring if Hyper-V supports NIC Teaming.

Resolution: Our support policy for NIC Teaming with Hyper-V is now documented in KB968703.

Hyper-V Management Console

Issue #1

How to manage Hyper-V remotely.

Resolution:  The steps to configure remote administration of Hyper-V are covered in a TechNet article. John Howard also has a very thorough blog on remote administration.

Import/Export

Issue #1

Importing a virtual machine may fail with the following error:

A Server error occurred while attempting to import the virtual machine. Failed to import the virtual machine from import directory <Directory Path>. Error: One or more arguments are invalid (0x80070057).

Resolution: Perform the steps documented in KB968968.

Miscellaneous

Issue #1

You may experience one of the following issues on a Windows Server 2003 virtual machine:

·         An Event ID 1054 is logged to the Application Event log:

Event ID: 1054
Source: Userenv
Type: Error
Description:
Windows cannot obtain the domain controller name for your computer network. (The specified domain either does not exist or could not be contacted). Group Policy processing aborted.

·         A negative ping time is displayed when you use the ping command.

·         Perfmon shows high disk queue lengths

Cause: This problem occurs when the time-stamp counters (TSC) for different processor cores are not synchronized.

Resolution: Perform the steps documented in KB938448.

As always, we hope this has been informative for you.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Active Route Gets Removed on Windows 2008 Failover Cluster IP Address Offline

We have received calls adding static routes on Windows Server 2008 Failover Clustering nodes and wanted to pass along some important information regarding this.  The issue is that when you add a static persistent route to a network adapter that is on a Windows Server 2008 Failover Cluster and take a Clustered IP Address offline (or move it to another node), the “Active” route is removed and no connections can be made using this route even though it still shows as persistent.  Once you bring the Clustered IP Address back online, the active route is returned.

First, I want to mention that the networking architecture in 2008 Failover Clustering has been rewritten from the ground up and now we have a  our own internal route listings as well as our own adapter (Microsoft Failover Cluster Virtual Adapter).  I will not into the specifics of it here, but you can read more about it in the blog “What is a Microsoft Failover Cluster Adapter anyway?”. 

On to the problem and resolution.  As a little setup, here is the configuration that I want to discuss.

ClusterNode1

Physical IP Address: 10.44.60.4

Physical Subnet Mask: 255.255.0.0

Default Gateway: 10.44.60.1

 

ClusterNode2

Physical IP Address: 10.44.60.3

Physical Subnet Mask: 255.255.0.0

Default Gateway: 10.44.60.1

Failover Cluster Virtual IP Address

IP Address: 10.44.60.6

I also have a backup server that I use to create backups using an IP Address of 10.51.0.1 and subnet mask 255.255.0.0 that will use the same default gateway above.  Most Network Administrators would use the following ROUTE.EXE command to add a persistent static route to the local tables so that a connection can be made.

route -p add 10.51.0.0 mask 255.255.0.0 10.44.60.1

So with everything online (including the Failover Cluster Virtual IP Address) on ClusterNode1, I can do a ROUTE PRINT command to display my IP Address version 4 table and see this.  As a side note, I am just pulling the necessary information from the Route Table.

C:\>route print -4
IPv4 Route Table
===========================================================================
Active Routes:
Network Destination  Netmask          Gateway       Interface     Metric
10.44.0.0            255.255.0.0      On-link       10.44.60.4    276   
<<---
10.44.60.4           255.255.255.255  On-link       10.44.60.4    276    <<--- Physical Node IP Address
10.44.60.6           255.255.255.255  On-link       10.44.60.4    276   
<<--- Clustered IP Address
10.44.255.255        255.255.255.255  On-link       10.44.60.4    276   <<---
10.51.0.0            255.255.0.0      10.44.60.1    10.44.60.4     21    <<--- Static Route added
224.0.0.0            240.0.0.0        On-link       10.44.60.4    276    <<---
255.255.255.255      255.255.255.255  On-link       10.44.60.4    276    <<---
===========================================================================
Persistent Routes:
Network Address      Netmask          Gateway Address   Metric
10.51.0.0            255.255.0.0      10.44.60.1        1                
<<--- Persistent Route added
===========================================================================

As long as the Clustered IP Address of 10.44.60.6 is online on this node, all is well.  However, if I were to take the 10.44.60.6 IP Address offline, things change.

C:\>cluster res "IP Address 10.44.60.6" /offline
Taking resource ''IP Address 10.44.60.6'' offline...
Resource                 Group         Node             Status
--------------------     ----------    ---------------  ------
IP Address 10.44.60.6    Data Group    ClusterNode1     Offline

C:\>route print -4
IPv4 Route Table
===========================================================================
Active Routes:
Network Destination  Netmask          Gateway       Interface     Metric
10.44.0.0            255.255.0.0      On-link       10.44.60.4    276   
<<---
10.44.60.4           255.255.255.255  On-link       10.44.60.4    276    <<--- Physical Node IP Address
10.44.255.255        255.255.255.255  On-link       10.44.60.4    276  
<<---
224.0.0.0            240.0.0.0        On-link       10.44.60.4    276    <<---
255.255.255.255      255.255.255.255  On-link       10.44.60.4    276    <<---
===========================================================================
Persistent Routes:
Network Address      Netmask          Gateway Address   Metric
10.51.0.0            255.255.0.0      10.44.60.1        1                
<<--- Persistent Route added
===========================================================================

Notice here that the Clustered IP Address 10.44.60.6 as well as the 10.51.0.1 “Active” route is removed.  Because the 10.51.0.0 route is removed, connectivity to the backup server is lost.  If you bring the Clustered IP Address 10.44.60.6 online again, the “Active” routes are re-populated again and connectivity to the backup server is restored.

C:\>cluster res "IP Address 10.44.60.6" /online
Bringing resource ''IP Address 10.44.60.6'' online...
Resource                 Group         Node             Status
--------------------     ----------    ---------------  ------
IP Address 10.44.60.6    Data Group    ClusterNode1     Online

 

C:\>route print -4
IPv4 Route Table
===========================================================================
Active Routes:
Network Destination  Netmask          Gateway       Interface     Metric
10.44.0.0            255.255.0.0      On-link       10.44.60.4    276   
<<---
10.44.60.4           255.255.255.255  On-link       10.44.60.4    276    <<--- Physical Node IP Address
10.44.60.6           255.255.255.255  On-link       10.44.60.4    276   
<<--- Clustered IP Address
10.44.255.255        255.255.255.255  On-link       10.44.60.4    276   <<---
10.51.0.0            255.255.0.0      10.44.60.1    10.44.60.4     21    <<--- Static Route added
224.0.0.0            240.0.0.0        On-link       10.44.60.4    276    <<---
255.255.255.255      255.255.255.255  On-link       10.44.60.4    276    <<---
===========================================================================
Persistent Routes:
Network Address      Netmask          Gateway Address   Metric
10.51.0.0            255.255.0.0      10.44.60.1        1                
<<--- Persistent Route added
===========================================================================

According to our Networking Development Groups, the recommendation actually is that on-link routes should be added with a 0.0.0.0 entry for the next hop, not with the local address (particularly because the local address might be deleted) and with the interface specified.

The ROUTE.EXE command has additional parameters of METRIC and INTERFACE that you would need to specify that will bind the route to the card itself.

C:\>route /?

Manipulates network routing tables.

ROUTE [-f] [-p] [command [destination]

                  [MASK netmask]  [gateway] [METRIC metric]  [IF interface]

  interface    the interface number for the specified route.

  METRIC       specifies the metric, ie. cost for the destination.

So what you need to do first is determine what the interface is so that we can bind the route to it.  When doing the ROUTE PRINT or NETSH command, it will give you the interfaces at the top first.  Something similar to this:

C:\>route print

IPv4 Route Table

===========================================================================

Interface List

23 ...00 15 5d 4a ac 06 ...... Local Gigabit Controller

19 ...00 15 5d 4a ac 01 ...... Local Gigabit Controller #2

18 ...00 15 5d 4a ac 00 ...... Local Gigabit Controller #3

===========================================================================

-or-

C:\>netsh int ipv4 show int

Idx  Met   MTU       State        Name

---  ---  -----      -----------  -------------------

18   50 4294967295  connected    Local Gigabit Controller #3

19    5   1500      connected    Local Gigabit Controller #2

23    5   1500      connected    Local Gigabit Controller

I can go into the Network and Sharing Center if I have to to see which card is on this network.  In my particular case, the “Local Gigabit Controller #3” is the one I want to use.  So to get my persistent route to stay even though the Clustered IP Address goes offline, my command would be below.  Please note that the METRIC is not needed as a requirement of the command.

route -p add 10.51.0.0 mask 255.255.0.0 0.0.0.0 metric 276 if 18

Now, the “Active” route will stay and you will have your connectivity regardless if a Clustered IP Address is online or offline.  A good rule of thumb, moving forward, would be that if you are adding a static persistent route, start specifying the 0.0.0.0 and the interface as this is the proper supported commands from a networking perspective.  This will result in the proper functioning no matter if Failover Clustering is configured or not.

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Introduction to SharePoint Search Indexes for DPM Administrators

This blog introduces the basics of SharePoint Search capabilities to DPM Administrators. Initially, it may be a bit confusing as to why a DPM blog would contain information about how to setup and configure another technology’s software. In protecting this data using DPM, it is necessary to have an understanding of how the protected application is installed and configured. This is especially true with regards to Search Index database protection.

By the conclusion of this short series of blogs, you will have a greater understanding of why knowing a bit about the applications that DPM protects helps in their protection, recovery, and maintenance.

Why 2 versions of Search

Since DPM offers the capability to protect both WSS Search and MOSS Search, let’s take a moment to understand the difference before we dive into the set up and configuration aspects.

First, let’s discuss the basics of what a Search Index server used for. This is the same principal as how the Internet works. Websites are “crawled” (queried for their contents) and a reference to their content is then stored in a database. Think of this as you would an index to a book. From a SharePoint perspective, this allows user to location content in the farm without knowing where the content actually resides.

“Office SharePoint Server 2007” provides two search services: “Office SharePoint Server Search” and “Windows SharePoint Services Help Search”. Each of these services can be used to crawl, index, and query content, and each service uses a separate index.

The “Office SharePoint Server Search” service is based on the search service that is provided with earlier versions of “SharePoint Products & Technologies”, but with many improvements. You should use the “Office SharePoint Server Search” service to crawl and index all content that you want to be searchable (other than the Help system).

The “Windows SharePoint Services Help Search” service is the same service provided by “Windows SharePoint Services 3.0”, although in “Windows SharePoint Services 3.0” it is called the “Windows SharePoint Services Search” service. “Windows SharePoint Services 3.0” uses this service to index site content, index “Help” content, and service queries.

Installing a Search Index Server

The installation of a server for Search is the same as with any other SharePoint server. Use the “Advanced” option during installation and add it to an existing farm. The piece that defines the server as a Search server is found in the “Central Administration” website under “Operations” and “Services on Server”

An example of this screen is shown below for your convenience.

clip_image002

After clicking on the “Start” action, a wizard is displayed where you will need to specify some basic configuration information. In this wizard, you will make selections on which WFE’s (Web Front-End server) will be used to crawl the content of sites in the SharePoint farm and add their content references to its index.

As part of the configuration, you will also need to create a database for the index references to be stored. This is also done from within the Central Administration website.

Below is a screenshot of the information that must be specified when creating the database. When creating the database, be sure to give it a name that makes it easy to identify its purpose.

clip_image004

With the content database created, we need to look at the creation of a Shared Services Provider.

Shared Services Provider (SSP) Configuration

Once the services have been configured on the Search Index server and are running, you are down to the last step. An SSP (Shared Services Provider) must be created as this is the SharePoint object that has the crawling functionality in it.

We will start by creating the Shared Services Provider by clicking on the Shared Services Administration link on the left side from within the Central Administration website’s main page.

clip_image006

Begin by clicking on the New SSP link as seen below by the pointer.

clip_image008

This opens a new page where you will specify the configuration information needed for SharePoint to create the provider. A database is also needed for this so make sure that the database name is clear as to its purpose. Once the necessary information has been provided, click on the “OK” button at the bottom to allow SharePoint to create the provider.

clip_image010

Once complete, you should see a page like the following indicating that you have successfully created a Shared Services Provider.

Now, we need to figure out how to crawl the sites in the farm.

SSP Crawl Configuration

When the Shared Services Administration page opens for this SSP, click on the Search administration link under the Search section to begin the crawl configuration.

clip_image012

When the Search Administration window appears, notice that the Items in index value is ‘0’ which is expected as this SSP was just set up and has not yet been configured. Under the Crawling section on the left, click on the Content sources link to start the configuration.

This will bring up the Manage Content Sources window.

clip_image014

After the Manage Content Sources window appears, there should only be a single entry in the list titled Local Office SharePoint Server sites. Click on this link as this is what will need to be configured in order to make this SSP useful.

clip_image016

At the top of the configuration page titled Edit Content Source, you have the option of changing the name to something more suitable. This is an optional entry.

If there are other site addresses that you would like this SSP to crawl, enter them in the box as seen below. If you are unsure of the URLs of the SharePoint sites, there is an alternate location that we will discuss in a few steps that will set this up for you. You will only need to select radio buttons based on a list of sites.

Here, the settings begin to get more impactful. In the Crawl Settings section, choose one of the two radio buttons which essentially allow you to filter on whether the content is SharePoint in nature or not. Choosing the Crawl everything… radio button allows the crawl to capture the maximum amount of data but also opens up the searches to additional content that you may not be expecting. For a test environment, this is the best solution as long as none of your SharePoint sites contain external site references on the Internet as the crawl will catalog the Internet data as well.

clip_image018

The last settings to make are for scheduling when the crawls will take place on the sites selected. Selecting the Create schedule links under the Full Crawl and Incremental Crawl sections will allow for a modest configuration schedule.

clip_image020

Each Create schedule link will open the following window offering multiple types with multiple settings for each. In a lab-style environment, these can be quite lenient with longer intervals. If you are testing or working in a rapidly updating environment, then a shorter interval is recommended.

Click on OK to accept the schedule you have configured and then fill in the Start full crawl of this content source checkbox so that when you click OK, the initial crawl will begin.

Depending upon how much configuration and modification you have done to the sites specified in the crawl list, the crawl should run in just a few minutes. The Status field should show Crawling Full when the page is initially drawn. You can monitor the page or you can leave it and continue with your SharePoint work. Once the initial crawl is complete however a Search site should be setup, if it has not already been added, so that you can test the index information for accuracy.

clip_image022

Creating Crawl Rules

A short configuration step to take is to include the paths and authentication accounts for crawling. To reopen the crawl rules, click on the Crawl rules link under the Crawling section on the left side of the SSP’s Search Administration page.

clip_image014[1]

Since this is a newly created SSP, there will be no rules. Click on the New Crawl Rule link to create rules for this SSP.

clip_image024

There are 3 sections to be configured here and the following shows a test configuration that works. For simplicity, use these settings and click on OK to complete the configuration.

clip_image026

clip_image028

The Manage Crawl Rules page should return pretty quickly to let you know what configuration has been made. The crawl that you ran should be done by now as well.

clip_image030

Working under the premise that the crawl has been completed, a return to the Search Administration site should display entries like the following. Observe that the Items in index field now has a non-zero value indicating that there are entries in the index.

clip_image032

To supplement the high-level overview of the steps covered here, a video has been provided that will guide you from start to finish.

Video

Vic Reavis
Support Escalation Engineer
Microsoft Enterprise Platforms Support

SharePoint Content DB Mirroring for DPM Administrators

In this blog post, we will cover the mirroring of an existing SharePoint content database using SQL Server 2005. We will also discuss some basics of Search Index Servers in a SharePoint farm. The objective is to provide a light-weight cursory knowledge of how to setup and configure these two aspects of the SharePoint farm. This will not be a detailed blog covering endless situations but a DPM Administrator-oriented post in preparation for protecting these data sources using DPM 2007 SP1.

Subsequent blogs in this series will cover Protecting mirrored SharePoint databases, handling failovers of mirrored SharePoint databases, recovering mirrored SharePoint databases and a couple of posts on protecting and recovering Search Index databases.

Periodically, there will also be short video demonstrations provided to round out the blog topic. Additional information and greater detail about the various ways to mirror SharePoint databases can be found online at http://technet.microsoft.com.

Mirroring a SharePoint Content Database

In our demonstration, we will be mirroring a single configuration database that is in a farm with several other content databases. There is no limitation stating that all content databases in a farm must be mirrored. There is flexibility in which content databases you choose to mirror and which will be left un-mirrored. We will take an existing content database and go through the steps of mirroring it across a SQL 2005 Failover cluster and a stand-alone SQL Server.

We created the content database “WSS_Content_8100” for a new “team docs” site we were going to build. You see it highlighted below in the list of databases on this cluster from within “SQL Management Studio” console.

clip_image002

Having seen the database that we are going to mirror, let’s go back to the Central Administration web site for SharePoint and prepare to mirror the content database. Under the Application tab, choose the link Content Database so that the Manage Content Databases page appears as shown below.

clip_image004

Under the Database Name field is a link to the Manage Content Database Settings page for the database whose name appears there. In our example, the WSS_Content_8100 database name appears there and when we click on that link, it will bring us to the screen we see below.

In the Database Information section at the top, note the Database status is likely set to Ready as this is the default setting. Before we mirror this content database, we need to change this to Offline so that no new site collections can be added to this content database while we are setting up the mirror.

clip_image006

This is the only change we are going to make so scroll down to the bottom of the page and click OK to commit the change. When it completes, the Database Status should now appear as Stopped.

clip_image008

Now that the SharePoint side of the configuration has been made, we will proceed to the SQL Management Studio side to begin the process of mirroring the database.

If you have not already seen the blog “SQL Database Mirroring for DPM Administrators”, now would be good time to jump over and review it. Since the steps on mirroring a database are the same as mirroring a SharePoint content database from this perspective, the steps will not be duplicated here.

Once the database has been mirrored and failover has been confirmed to work, you can now return the Database Status to Ready as shown above.

When mirroring has been setup on the database, you will see a configuration like the following in SQL Management Studio.

clip_image010

As long as the Principal server is the same as the server on which the content database originated, you and your users should be able to see the content in any sites within the Web Applications that use that content database.

Testing Failover of the Mirrored Database

Now that the database has been mirrored, it is time to test failover of the mirror to confirm that the data is still accessible to SharePoint users.

From SQL Management Studio, failover the database from the Mirroring page in the database properties. Refresh the SQL Management Studio view to confirm that the Principal role is now held by the other server in the mirroring partnership.

Now navigate to a well-known page in a site whose information is stored in the newly mirrored content database. Refresh the page and the following error should be returned. The URL doesn’t mention anything about the SQL2k5-CLU SQL instance so there is no clue there. The only indication that we get is at the end of the message where it states “…make sure the database server is running.”

It indicates a database issue.

clip_image012

Note: This configuration change coming up only needs to be performed on a single WFE (Web Front-End) server and not every WFE in the farm. Anytime a mirrored content database is failed over, make this change on one WFE server to restore user connectivity to the content.

To correct our user access issue so that they are able to view the previously accessible content, we will need to go back to the Manage Content Databases page and click on the Database Name object for the database that we just failed over.

clip_image014

Scroll straight to the bottom, check the Remove content database and commit the change.

When you click on the Remove content database checkbox, the following message is displayed. Consider this: users are not able to access the content anyway because the database is not accessible on the SQL instance where it is expected to be. Taking this action is required in order to restore user access.

clip_image016

Click on OK knowing that this is the right thing to do.

Tip: At the top of the page is the database name and the SQL server name. Copy the database name to the buffer\clipboard so you avoid any typing mistakes. You may want to copy and paste this and the database server both to Notepad for future reference.

Now click on OK at the bottom of the page and allow SharePoint to remove the entry.

Now that the Manage Content Databases page is empty, we can add the entry back in. Click on the Add a content database link near the top after confirming that the Web Application in the list on the right is the correct one.

clip_image018

When the Add Content Database page appears, enter the name of the SQL Server where the current Principal role for the database mirror resides. Now tab down and enter in the content database name for the mirrored database. Once these are done, scroll down and click on OK to commit these entries.

clip_image020

The WSS_Content_8100 database should now appear as seen in the screen shot below with a Started status. This is what you need to do from the SharePoint side to restore user connectivity. Now it is time to verify user access to the content.

clip_image022

Back to our IE session at this point to verify if the web content is accessible.

In some cases, the addition of the content database will error out. DO NOT PANIC!! The link just above the Error will redirect you back to the Manage Content Databases page. Click on this and confirm that the WSS_Content_8100 entry is still there. Also, check to see that the database is using the SQL2k5-MirrorA server (based on our example) and not the original. If these things check out and the page is accessible by users, this Error can be ignored.

clip_image024

Congratulations! You have successfully mirrored a SharePoint content database using SQL 2005. Having this understanding of a technology that DPM protects, even to such a basic level, will be a great help when working these types of issues.

Summary

This blog covered the steps necessary to take a functional SharePoint content database that resides on a SQL 2005 installation and mirror it using SQL Database Mirroring. Also discussed was how to stabilize a SharePoint site when the content database it is a part of is failed over from one server to the other.

The video included with this blog demonstrates the steps outlined here. This is provided to help reinforce the steps discussed and provide quick reference for future configurations.

Video

 

Victor Reavis
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Recovery of a Mirrored SQL Database

Recovering a mirrored database, whether it is to a SQL Server 2005 or SQL Server 2008 installation, requires that the existing mirror be broken. For simplicity, you may want to consider deleting both the Principal and Mirror databases and using the latest recovery point that DPM has in order to restore the data.

This blog post is geared towards a simple database mirroring recovery scenario and doesn’t discuss the replaying of transaction logs as part of the recovery process. As with any blog post, if there is additional information you would like to see provided or questions you would like to see answered, please pass those comments and questions along so that we can consider those for future posts.

This blog also includes a video demonstration of the recovery process involved in recovering a mirrored SQL database.

In our scenario, we will work under the premise that a bad transaction has been posted to the database and replicated without any means of backing it out cleanly. Now we have to restore the database to the servers and establish mirroring again. We will not consider the restoration complete until DPM is able to create a new recovery point after the restoration. This is a standard measure used to define when a recovery has been completed successfully.

Considering the scenario we have described, let’s begin.

In the screenshot below, we see that we have database corruption which needs to be recovered.

clip_image002

We begin by opening the properties of the Principal database and going to the Mirroring page. On the right side of the page is a button titled “Remove Mirroring”. Click on this button and confirm the selection by clicking on the “Yes” button in the following dialog to remove mirroring. After mirroring has been removed, you will be able to begin the restoration process.

If mirroring is left enabled, then DPM will fail on the restore with a detailed error message indicating that mirroring is still enabled and must be removed before the restore can be completed successfully.

clip_image004

With mirroring broken, delete both copies of the database on the Principal and Mirror servers. During the restore, you will decide which will be the Principal and which will be the Mirror based on how the restore to each server is performed.

With each database deleted, begin the recovery process by navigating to the Recovery tab in the DPM Admin console and selecting the recovery point you wish to restore. When the “Specify Database State” page of the Recovery Wizard appears, you will have the option here to choose whether this server will be the Principal or the Mirror.

In the screen shot below, you will note that the “Leave database operational” radio button has been selected which indicates that the server selected will become the Principal server.

clip_image006

Before you start the recovery to the Mirror server, you must make sure that you are recovering using the same recovery point that was used for the Principal server’s recovery. If not, you will receive errors when attempting to establish mirroring between the two copies of the database.

When the recovery has progressed far enough along that you can begin another recovery, you can start the Mirror database recovery. In the “Specify Database State” page of the Recovery Wizard, make sure that you choose the radio button for “Leave database non-operational but able to restore additional transaction logs”. This will restore the database with the “Restore with NoRecovery” option enabled. As you recall from the blog on establishing a database mirror, this is a requirement when seeding the Mirror server.

This radio button helps to minimize the number of steps involved in the recovery process.

clip_image008

Once both of the restores have been completed and the Principal and Mirror database servers have the same copy of the database restored, you should verify that the necessary data has been restored to the server. If the expected data is still missing, you may need to consider restoring from a different recovery point.

If there are additional transactions that need to be replayed, consider creating the mirror and then replaying the transactions so that SQL will replicate them on the fly to the mirror.

With the data having been restored to both servers, now it is time to run the “Database Mirroring Wizard” from within SQL Server on the Principal server and setup mirroring on the database. Once mirroring has been established, you have completed the portion of the data restore that your users are concerned about.

You have not completed the restore from a DPM perspective, however. There is still an additional step to consider.

Since the database was recovered to its original location, DPM will not be able to create any additional Recovery Points until a consistency check has been run on the database. Once this consistency check has completed successfully, a new recovery point will be created.

From the DPM perspective, this newly created recovery point confirms the successful completion of the mirrored database recovery.

Summary

In recovering a mirrored database, the mirror must be broken first. After the mirror has been broken, the same recovery point must be used when populating the Principal and Mirror servers.

After the restore has been completed and the data verified, the mirror can then be recreated. Before DPM can continue to protect the mirrored database, a consistency check will need to be run. Once complete, a new recovery point will have been created and DPM will be able to continue protecting the mirrored database moving forward.

Video

 

Vic Reavis
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Recovering a Deleted Cluster Name Object (CNO) in a Windows Server 2008 Failover Cluster

Greetings once again from the support trenches here on the CORE team.  I want to talk a bit about a Windows Server 2008 Failover Cluster issue that appears to be on the rise.  What we are seeing is the Computer Object for the Cluster Name (a.k.a. Cluster Name Object (CNO) being removed from Active Directory resulting in the Cluster Name no longer being able to function properly.  This does not happen automatically.  It requires some sort of human interaction either by consciously going into AD and deleting the object or running some script (process) that deletes it.  However this is being done, it appears to us that the implications are not fully understood and there is no quick recovery from this.  In this blog, I hope to provide information that will help avoid this scenario from happening within your organization.  Along the way, I want to provide some 'value-add' information by discussing how the cluster computer objects relate to each other.

The first step to preventing this from happening in your organization is to be sure there is a clear understanding of the cluster security model in Windows Server 2008.  Rather than spend a whole lot of time and space here rehashing what is already publicly available, I refer you to the following:

KB 947049: Description of the Failover Cluster Security Model in Windows Server 2008.

Failover Cluster Step-by-Step Guide:  Configuring Accounts in Active directory

After reviewing the materials, you should have an understanding of how security works in Windows Server 2008 Failover Clusters and an appreciation for the importance of not removing (or disabling) the Computer Objects created in Active Directory by the cluster.  By default, the Computer Objects created by the cluster are all placed in the Computers container.  These can be relocated to another OU, or even pre-staged in an OU before the cluster is created.  If pre-staging, be sure to review the requirements in the Step-by-step Guide already mentioned. As an example (Figure 1),  I created a Cluster OU and moved the cluster nodes and their associated objects into the OU. 

clip_image002

Figure 1

You may want to consider implementing a similar practice in your organization as it groups the cluster objects together thereby reinforcing the idea that this grouping of objects is 'special' in some way. 

Before moving forward and discussing the actual recovery process, I want to spend a little time reviewing the cluster 'family tree' to help you gain an understanding of how cluster objects are related.  To illustrate, I will use a cluster named W2K8-CLUS (Figure 2) in the CONTOSO domain.

clip_image004

Figure 2


 

This cluster is located in the Cluster OU shown in Figure 1.  Using Regedit.exe, I open the cluster registry hive and inspect the properties for the cluster.  I can see the name of the cluster and the resource GUID for the Cluster Name.

clip_image006

Figure 3

Expanding the Resource GUID corresponding to the Cluster Name, I inspect additional properties for the resource.  Selecting the Parameters entry displays the ObjectGUID for the cluster Computer Object in Active directory (Figure 4).

clip_image008

Figure 4


 

In Figure 5, we see the attribute in Active directory (must enable Advanced Features before the Attribute Editor tab is visible).  You can also use ADSIEdit to view the same information.

clip_image010

Figure 5

The Cluster Name Object (CNO) functions as the primary security context for the cluster.  The CNO is responsible for creating any additional Computer Objects (Virtual Computer Objects (VCO)) associated with the cluster.  These Computer Objects represent Network Name resources in a cluster.  A Network Name resource is created as part of a Client Access Point (CAP).  Each Computer Object created by a cluster CNO contains an Access Control Entry (ACE) for the CNO on the Access Control List (ACL) for the object.  The CNO is also responsible for synchronizing the password for each VCO in the domain.  The VCOs associated with a particular CNO can be determined either by manually inspecting the ACL for each VCO in AD, or the information can be obtained in the cluster registry. 


 

Opening the cluster registry hive and inspecting the properties of the Cluster Name resource, we can see an entry called ObjectGUIDS.  This is a listing for each Computer Object created by the CNO in Active directory.  In Figure 6, I have four Computer Objects in Active Directory associated with this cluster.  

clip_image012

Figure 6

One of them is a Computer Object (VCO) associated with the CAP representing a highly available Print Server (CONTOSO-PS1) in this cluster (Figure 7).

clip_image014

Figure 7

Well, there you have it…the cluster family tree.

So, what happens if the Cluster Name Object is deleted from Active Directory?  A few important things –

·         The Cluster Name, if Online, will stay Online but will fail to come Online again if the resource is cycled (it will be placed in a Failed state).  This will prevent being able to connect to the cluster remotely when trying to administer the cluster.

·         The security context for the cluster is lost.  This prevents the passwords for all associated VCOs from being synchronized within the domain.  Also, any user, service or other process needing permission to access cluster objects will fail to be authenticated.

·         No more CAPs can be created in the cluster.

Besides the items listed above, there are other indications of problems.  The Cluster Name resource in the Cluster Core Resources group will be in a Failed state.  Attempts to bring the resource Online will generate a pop-up error (Figure 8)

clip_image016

Figure 8

A FailoverClustering   error (Event ID 1207) will be registered in the System Log (Figure 9).

clip_image018

Figure 9

The cluster log will report a failure to locate the CNO Computer Object in Active Directory (Figure 10)

clip_image020

Figure 10

It is, therefore, very important the CNOs Computer Object in the domain not be deleted. 

How does one recover from this?  The supported way(s) to recover an Active Directory object that has been accidentally, or intentionally, deleted are described in the following articles and will not be covered in detail here–

KB840001: How to restore deleted user accounts and their group memberships in Active Directory

TechNet   Content -   Recovering Active Directory Domain Services

Additionally, there are 3rd party solutions that can be used to protect Active Directory objects and\or recover them if deleted. Finally, as a last ditch effort, and when there is no other alternative, there is a free utility called ADRestore (32-bit only) that can be used to recover the Computer Object associated with the CNO.  Please review the following information before deciding to use this utility –

Microsoft Supportability Newsletter – Using ADRestore tool to restore deleted objects

 Either of these methods can be used, but they may end up being time consuming, expensive or both.  

Once the Computer Object has been recovered from Active Directory, the Repair Active Directory object action can be used to restore functionality in the cluster (Figure 11).

clip_image022

Figure 11

Note:  The logged on user that will perform the Repair action must have rights to administer the cluster and must have the right to Reset Passwords in the domain.

I personally believe ‘an ounce of prevention is worth a pound of cure.’ To that end, my top recommendation is to implement the steps outlined in the section Preventing unwanted deletions in the TechNet Content already mentioned above.  Beginning with Windows Server 2008, objects in Active Directory, such as the Computer Object shown here (Figure 12), can be protected from accidental deletion by simply checking a box – Protect object from accidental deletion.

clip_image024

Figure 12

With this ‘guard’ in place, when an object is selected for deletion, the first pop-up is presented (Figure 13)

clip_image026

Figure 13

If Yes is selected, the next error is presented to the user (Figure 14) thus preventing deletion.

clip_image028

Figure 14

If this isn’t enough, there is more help coming in Windows Server 2008 R2.  Domain Services in Windows Server 2008 R2 will include an optional feature called Active Directory Recycle Bin.  This feature is not enabled by default and must be added.  Details about the feature can be found on TechNet

TechNet Content – Active Directory Recycle Bin Step-by-Step Guide

That about wraps it up for this installment.  As usual, we hope this information is useful.  Come back and visit.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Protecting Mirrored Databases with DPM 2007 SP1

Considering the vast amount of mission critical data stored on SQL servers globally, the high availability of database mirroring with the frequent snapshots of DPM make protection and recovery scenarios much less fearful for administrators entrusted with protected SQL data.

Database mirroring protection can be thought of as a blend between SQL Failover Clusters and stand-alone SQL Servers. As with SQL Clusters, DPM is able to follow the database as it is failed over from one node to another. At the same time, DPM still provides the same level of protection scheduling options and recoverability as it does with stand-alone SQL databases. DPM is also capable of protecting a mirrored database when it is encrypted, via the use of certificates, or if the mirroring configuration spans domains or even forests.

Since we have a fundamental understanding of database mirroring based on the blog “SQL Database Mirroring for DPM Administrators”, let’s dive in and discuss what is needed to protect a mirrored database.

There are videos available for viewing as supplements to this blog which demonstrate the protection of a mirrored database which spans a Windows Failover Cluster on the Principal side and the Mirror server as a stand-alone SQL server. Another video covers a basic database mirror across a pair of stand-alone SQL Servers, while the third demonstrates what happens when a protected database is mirrored after protection has been running for some time.

Establishing Protection

When establishing protection of a SQL Server 2005 or SQL Server 2008 mirrored database, you will need to confirm that the DPM Agent has been installed on both the Principal and Mirrored servers. This is a requirement in order to implement protection. If one side of the mirror is a Windows Failover Cluster, then both nodes of the cluster and the SQL Mirror server must all have the DPM agent installed.

In the following example, we will protect the AdventureWorks database which has been mirrored across a stand-alone server and a 2-node failover cluster. All 3 servers involved have the DPM agent installed already.

clip_image002

As we see in the Create Protection Group wizard, the datasources for the cluster as well as the stand-alone server have been displayed. Notice that text has been appended to the name of the AdventureWorks database which tells which physical or virtual server the database mirror extends to.

clip_image004 clip_image006

Continue through the wizard setting up the disk and tape protection, the retention range, synchronization frequency, etc., until the wizard completes. At that point, the initial replica of the protected data will be taken if you have chosen the appropriate option in the Choose Replica Creation Method page of the wizard.

After a period of time, the Protection Status in the Protection tab will show OK for the newly created protection group. DPM is now able to protect the mirrored database you have selected. As with all things however, the database will not remain on the current Principal server for the remainder of eternity. So what happens when there is a failover of the mirror and the mirroring roles are switched?

Immediately, there will be no indication that DPM has a problem protecting the database. If the mirrored database fails back to the original node before the next recovery point or sync is run, then DPM will continue protecting the data source as if nothing happened. If the database is on a different server than expected by DPM, then DPM will fail to successfully create a recovery point.

The error will be specific in its detail and will read like the following.

Error 32015: DPM is unable to continue protecting the selected database because DPM detected a mirroring session failover for this database.

Recommended action: Run a synchronization job with consistency check.

Simply running a consistency check will take care of the problem and future recovery points should complete successfully. If the failover occurs and there is no one able to get to the server to force a consistency check within 30 minutes after the failure, DPM will automatically run one. This can be seen in the Monitoring tab of the DPM Admin Console under the Jobs tab. Look for the failed Recovery Point job that was run after a mirrored database failover. Look for a Consistency Check (CC) job to run 30 minutes later on the same data source. When that CC runs to completion, a Recovery Point will be created as well. This can be verified by reviewing the available Recovery Point in the Recovery tab for that data source.

Changes in State

In situations where a protected database’s mirrored status changes, DPM will not be able to create a Recovery Point until you stabilize the data source because DPM will detect the change in state. Anytime a database goes from non-mirrored to mirrored or vice versa, the data source will need to be removed and added to the protection group again.

The reason is that during the creation of the protection group, DPM scans the server on which the database resides to see if it is mirrored or clustered with other servers. Because of the enabling or disabling of mirroring on the database, DPM will need to either add or remove dependent servers from the database in question.

Once the SQL datasource has been updated and added to the protection group, the Initial Replication will run and a new recovery point will be created. A word of caution: when reviewing the recovery points for the database after a change in mirroring state has been made, you will see two entries for the same database and each will have recovery points available for restore.

DPM does not distinguish in the Recovery tab between the recovery point copies that were and were not mirrored.

Videos

 

 

 

 

Summary

In this blog, we discussed how to protect a mirrored database and how to stabilize DPM’s protection of the mirrored data source when failover occurs and we discussed what happens when mirroring is enabled or disabled on an already protected database. Take a look at the included video content for a demonstration of the actions covered here.

Vic Reavis
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Issues after moving Virtual Machines from one Hyper-V parent (host) to another

 

If you move one or more of your virtual machines (VM) from one Hyper-V parent to another and the VM begins experiencing issues, it could be because the version level of Hyper-V between the two parents is different, and thus the version of the Integration Services may be different. This can result in some strange and unwanted behavior.

I recently experienced an issue in which I moved a VM from one parent that I was planning on rebuilding to another. The VM started and ran fine on the new parent system, but after 30 minutes or so, the virtual network adapter lost connection to the network and eventually was changed to a status of Disabled in Device Manager. Uninstalling and re-installing the adapter did not resolve the problem.

I eventually checked the version of the driver for the virtual network adapter and it displayed a strange version of 21.x.x.x. (I don’t recall the exact version), which is not at all close to what it should be. Since this driver is provided by the Integration Services, I uninstalled them from the VM, restarted it, and then installed them again. This resolved the problem. The network driver version is now 6.0.6001.18010, which is the current version.

The version of Integration Services is closely tied to the version of Hyper-V that is installed. Other issues may occur if these are mismatched, the issue with the network adapter was the first one I experienced. If you move VMs from a parent that is running the beta or RC version of Hyper-V, remember to uninstall the Integration Services (listed as Hyper-V Guest Components in Add or Remove Programs), restart the VM, and then install the version from the new parent.

 

Author:  Kevin McNiel
Senior Support Engineer
Microsoft Corporation

Windows 2008 Failover Cluster Validation Fails on ‘Validate SCSI-3 Persistent Reservation’

We’ve been seeing a lot of calls lately from customers who are running the validation that’s required prior to installing and configuring failover clustering, and the validation fails in the ‘Storage’ portion of the tests. The specific error seen in the validation report is:

image

If you click on the ‘Validate SCSI-3 Persistent Reservation’ link in the report. It will take you to the detail section.

Validate SCSI-3 Persistent Reservation

    Validate that storage supports the SCSI-3 Persistent Reservation commands.
    Validating Cluster Disk 0 for Persistent Reservation support
    Registering PR key for cluster disk 0 from node node1.cluster.com
    Failed to Register PR key for cluster disk 0 from node node1.cluster.com status 1
    Cluster Disk 0 does not support Persistent Reservation

If you dig a little deeper, you can also look at the ValidateStorage.txt file that’s located in the Windows\Cluster\Reports directory.

00000fd4.00000fd8::15:56:45.857 CprepDiskPRUnRegister: Enter CprepDiskPRUnRegister: ulSignature 0xd0426bb2
00000fd4.00000fd8::15:56:45.857 CprepDiskFind: found disk with signature 0xd0426bb2
00000fd4.00000fd8::15:56:49.977 CprepDiskPRUnRegister: Failed to unregister PR key, status 1117
00000fd4.00000fd8::15:56:49.977 CprepDiskPRUnRegister: Exit CprepDiskPRUnRegister: hr 0x8007045d
00000fd4.00000fd8::15:56:54.097 CprepDiskFind: found disk with signature 0xd0426bb2
00000fd4.00000fd8::15:56:54.097 CprepDiskIsPRPresent: Failed to read PR reservations, status 0
00000fd4.00000fd8::15:56:54.097 CprepDiskIsPRPresent: Exit CprepDiskIsPRPresent hr 0x0, Present 0
00000fd4.00000fd8::15:56:54.097 CprepDiskFind: found disk with signature 0xd0426bb2
00000fd4.00000fd8::15:56:54.097 DoIoctlAndAlloc: ControlCode 0x70050, retCode 1, status 122
00000fd4.00000fd8::15:56:54.097 CprepDiskGetArbSectors: Exit CprepDiskGetArbSectors: hr 0x0, SectorX 11 SectorY 12

So what is a “Persistent Reservation” (PR) and why should you care? A PR is a SCSI command, which clustering uses to protect LUN’s. When a LUN is reserved, no other computers on the SAN can access the disk, except the ones cluster controls. This is important to protect other machines from accessing the disk and corrupting the data on the disk.

Validate is a functional test tool that verifies that your storage supports all the necessary SCSI commands that clustering requires. It is critical that Validate tests pass, for your cluster to work correctly. The Storage tests are by far the most important, they should not be dismissed!

If you are reading this blog, then the bad news is that Validate has probably identified that your storage does not support Persistent Reservations, and is not compatible with Windows Server 2008 Failover Clustering. The good news is that it most likely will work, you just have to do a few things! All storage vendors and almost all current shipping models support Win2008 Failover Clustering, but many require firmware updates or configuration settings. Microsoft has been working closely with partners such as HP, EMC, IBM, NetApp, HDS, Fujitsu, Lefthand, Equallogic, Xiotech, NEC, LSI, Infortrend, 3PAR, Intransa, FalconStor, Nexsan, and even more… and they all work!

First things first, call your storage vendor and ask them if your storage is compatible AND configured for use with Windows Server 2008 Failover Clustering.

There are two things to verify:

  1. Correct firmware version
  2. Correct configuration settings

The storage vendor is really the right person to tell you how to correctly configure their arrays to work with Failover Clustering, so they are the right source. We can’t post the specific steps for each vendor but as we become aware of publicly available documentation from the SAN vendors, we’ll add them to this post as they start being published.

HP has a publicly available document detailing the steps needed to get SCSI-3 PR’s to work specific to their hardware (pages16-18)

Implementing Microsoft® Windows® Server 2008 Service Pack 2 beta on HP ProLiant servers

With any of these vendor links, although they may contain steps to resolve the PR problem, we still strongly recommend being directly engaged with the vendor to verify with them that these storage configuration changes are current, appropriate for your environment and hardware, and non destructive to your data. Microsoft makes no guarantees on any of the 3rd party links we are providing. They are solely intended to have information on hand to discuss with your particular vendor.

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

SQL Database Mirroring for DPM Administrators

This blog focuses on providing a very light-weight overview of how to implement SQL database mirroring and it is not intended to be a complete how-to reference. DPM administrators benefit by understanding, even at a basic level, how to install and configure the applications DPM will be protecting. This blog is intended to provide some of that knowledge as a starting point.

A video demonstration is provided to help guide you through the database mirroring scenario.

Need to Know

Support for SQL Database mirroring was introduced with SQL 2005 SP1. Before that, it was not a supported feature in SQL Server 2005.

Before implementing database mirroring, you must confirm that the Recover Model of the database is set to FULL. Full is the only recovery model that mirroring supports. The Full recovery model provides the normal database maintenance model for databases where durability of transactions is necessary.
Log backups are also required. This model fully logs all transactions and retains the transaction log records until after they are backed up. The full recovery model allows a database to be recovered to the point of failure, assuming that the tail of the log can be backed up after the failure. The full recovery model also supports restoring individual data pages.
For more information, see
Backup Under the Full Recovery Model.

SQL Server credentials must match across the servers involved in the mirror or credentials must be provided during the configuration of the mirrored database. In the example which follows, we will work under the premise that the SQL Server service accounts on all servers involved are using the same credentials.

If the mirror will span multiple domains, then Certificates will need to be implemented. Since the purpose of this blog is to provide DPM administrators with a cursory understanding of how to implement mirroring, the use of certificates will not be covered here.

When seeding the mirrored database, the recovery of the database must be done using the (RESTORE WITH NORECOVERY) option. This option leaves the database in a read-only mode to users but allows the Principal SQL Server to still restore transactions to the mirrored copy. SQL Administrators will not be able to pull up the properties of the mirrored database while in this configuration.

Witness servers are a feature that allows SQL to detect failures and automatically failover a database. Without a witness server (which is a 3rd SQL server distinct from the Principal and Mirror servers) manual failover is the only way to switch the mirror roles.

With the release of SQL Server 2008, a new feature is supported in SQL Server called ‘FileStream’. SQL Database Mirroring does not support this feature and if configured, a detailed error will appear during the creation of the mirror.

Mirroring a Database

The mirroring process between SQL 2005 and 2008 has enough over-lap that we can cover one, in this case mirroring of a SQL 2005 database, and still have a sound enough understanding of how a SQL Server 2008 database can be mirrored using the same steps. For this reason, we will cover the SQL 2005 database mirroring process. Further details can be found in the SQL Server section of the TechNet web site at http://technet.microsoft.com/en-us/library/bb545450.aspx.

Mirroring AdventureWorks

As with any database that will be mirrored, start by opening the properties and setting the Recovery Mode to “Full”, if it is not already.

clip_image002

Once this has been set, you will now need to make a backup of the database. Make sure it is a Full backup and for convenience, make sure that you are not using the Append option so that the BAK file size is minimized and the restore is less confusing.

clip_image004

On the mirror server, you will need to restore the backup of the database that you just made as part of the seeding process.

clip_image006

Make sure that when you perform the restore that you choose the middle radio button as shown below. If the restore is performed without the “RESTORE WITH NORECOVERY” option, SQL will not be able to setup mirroring between the two servers.

clip_image008

Now that the mirror server has a copy of the database restored to it, go to the Principal server and begin the mirror setup. When the Mirroring page of the database properties appears, click on the Configure Security button in the upper right corner of the window as shown below.

clip_image010

Decide whether you will be using a 3rd SQL Server as a Witness server and then click on Next.

If you have not setup mirroring previously, you will need to specify the port for mirroring to use and give the endpoint a name. This endpoint will be created on each server and can be viewed from SQL Management Studio.

Configure each server and confirm connectivity as well as the security credentials. Once connectivity has been confirmed and the SQL Service account credential information has been specified, click on Finish to allow the mirroring to be established between the two servers.

Once mirroring is configured, it is not started so you will have to click on the Start Mirroring button as shown below to begin the transaction replication from the Principal to the Mirror server. If all goes well, you will see the databases appear like the following.

clip_image012

If you want to try to failover the database so that Principal and Mirror roles are switched, click on the Failover button. This will cause a warning dialog to appear confirming your decision. Choose Yes to allow the failover to occur.

clip_image014

Basic Troubleshooting

As a DPM administrator, there is often not a lot of time allocated to troubleshoot setup and installation issues and these types of issues are common when working with unfamiliar technologies. Here are a few of the most common database mirroring configuration issues that arise. Take a look at http://TechNet.Microsoft.com for more detailed information about other failures that you encounter.

If the database is not in Full Recovery mode, the following is displayed. If the restored copy of the database did not have Full Recovery mode setup when it was backed up, then you will need to delete it and create a new backup after making this setting change.

clip_image016

If the mirror copy was not restored using the “RESTORE WITH NORECOVERY”, the following error will be displayed. Simply remove the restored copy of the database and restore it using this option to work around this issue.

clip_image018

In a busy environment, there may be transactions that are not captured. If this is the case, the following error may be displayed. You may need to place the Principal database in single-user mode and backup the transactions. Once these are restored to the mirror server, then try to configure mirroring.

clip_image020

Summary

Database mirroring can be established where one side of the mirror is on a Failover Cluster. You can even have the same SQL server acting in a mirroring partnership with 3 or 4 or more other SQL Servers. This is not a recommended practice however as this configuration can create confusion and, when it comes to DPM protection, all servers participating in a mirroring partnership must have the DPM agent installed. If the agent is not installed on all servers, DPM will not be able to protect the mirrored database.

In an upcoming blog, we will cover the protection of a mirrored database using DPM 2007 SP1. Subsequent blogs will also discuss the recovery of a mirrored database to its original location. When reading the blog on recovering mirrored databases, you may need to refer back to this blog as the recovery process requires the database mirror to be broken and re-established in order to complete the restore.

Vic Reavis
Support Escalation Engineer
Microsoft Enterprise Platforms Support

Error Message: Windows cannot access the required file d:\sources\install.wim when replacing install.wim with custom install.wim

My name is Scott McArthur, a Senior Support Escalation Engineer who focuses on deployment issues. My topic for today’s blog entry involves an issue where a customer was deploying Windows Server 2008 by replacing the default install.wim on the Windows Server 2008 media with a custom install.wim. 

When booting up the DVD they encountered the following error

Windows cannot access the required file d:\sources\install.wim.  Make sure all files that are required for installation are available, and restart the installation.

The x:\windows\panther\setuperr.log contained the following:

SetWindowsImageInfoOnBB:Failed while updating EditionID for volume PID.[gle=0x00000057]

CallBack_SetImageInfoOnBB:Failed to read and cache the Windows image's metadata; GLE is [0x0]

CallBack_SetImageInfoOnBB:An error occurred while trying to read and cache the images' metadata;

This error can occur if default install.wim located in the sources folder is replaced with a custom install.wim but the image was not captured by using the /flags switch. 

When you deploy a custom install.wim with Image Based Setup (IBS) it must know the edition of the image. 

You can verify this by running the imagex /info command with the install.wim.  For example

imagex /info d:\sources\install.wim

The image metadata must contain an entry similar to this.  If you do not see the following metadata entry then you know it was not captured by using /flags

<FLAGS>SERVERENTERPRISE</FLAGS>

When capturing the image you must use the following syntax

Imagex /capture c: z:\data.wim "drive c" /flags "EditionId"

Where EditionID is:

HomeBasic, HomePremium, Starter, Ultimate, Business, Enterprise, ServerDatacenter, ServerEnterprise,

ServerStandard

To fix this without re-capturing the image do the following on computer that has the WAIK installed

1.  Click Start, All Programs, Microsoft Windows AIK, and select “Windows PE Tools Command Prompt”

2.  Copy whole contents of DVD to c:\flat

3.  Run the following command to add the EditionID to the metadata

Imagex /info c:\flat\sources\install.wim 1 "image-name" "description" /flags "EditionID"

4.  Run the following command to recapture the ISO

oscdimg.exe -betfsboot.com -n -h -m c:\flat c:\2008.iso

Note if you use Microsoft Deployment Toolkit to create and capture your images it automatically adds the /flags switch.  Microsoft Deployment Toolkit is the preferred tool to use for deploying Windows. 

Additional information on the /flags switch can be found in the Windows AIK help files.  Specifically the WAIK.CHM in the imagex command line options.

Scott McArthur
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

More Posts Next page »
Page view tracker