Microsoft Enterprise Platforms Support: Windows Server Core Team
One of the most common issues with Windows Failover clustering deals with the storage attached to the cluster. This post is meant to give a very high-level overview to disk troubleshooting methodologies and how to narrow down the source of problems even before any logs or diagnostic data is gathered. This post will deal particularly with troubleshooting issues where a 'Physical Disk' resource is in a 'Failed' state. Issues dealing with adding new disks into an existing cluster or migrating disks between clusters will be covered in future posts.
** Note: If using 3rd party disk resources like 'Veritas Volume Manager Disk Group (vxres.dll)' or 'IBM ServeRAID Logical Disk (ipsha.dll)', engage that vendor directly as this article only pertains to cluster disk resources of type 'Physical Disk (clusres.dll)'.
The first step in troubleshooting disk failures is determining the extent of the problem. A few short tests can be performed on the cluster that narrow down where to begin more in depth troubleshooting. It is these steps that will be covered in this post. In the following paragraphs, I'll describe the possible symptom scenarios and give each scenario a potential root cause.
It's a good idea when troubleshooting disk resource failures to set the disk resource in question to 'do not restart'. This will keep the group the disk is located in from failing back and forth between nodes. Once the resolution has been found, don't forget to set the disk resource back to the default 'restart/affect the group' settings.
Here are the four possible combinations of symptoms that could help you narrow down where the potential disk problem lies. By failing over the group containing the failed disk resource, you’ll determine if the issue is disk specific or something more specific to the disk subsystem. By also failing over groups that contain disks that are working properly, you’ll also have a good idea of what is working and can eliminate those areas from possible focus.
** Disabling the cluster disk driver.
In Device Manager, right click on Device Manager/View/Show Hidden devices.
On the right, you should now see an icon for the cluster disk driver
Right-click on that driver/Properties, On the [Driver] tab, set the startup type to 'Demand'.
Now, set the cluster service to 'disabled' and reboot. Once the server comes up, you will be operating without any cluster components in place. Never do this process on more than one node in a cluster at a time. To reverse this process, set the cluster disk driver to 'system', start the driver, set cluster service to 'automatic' and start the service.
Author: Jeff HughesMicrosoft Enterprise Platforms SupportSupport Escalation Engineer
- Disk resource fails on one node but works on the other node(s) - another possible reason:
For Microsoft iSCSI architecture if there are no network connectivity problems and SAN integrity is checked, verify if iSCSI Initiator is up-to-date in Device Manger -> SCSI and RAID Controllers. If not, install the latest applicable version.
The installation packet includes
- iSCSI Port Driver (iscsiprt)
- Initiator Service (iscsiexe.exe)
- Software Initiator (msiscsi.exe)
This is the kernel mode iSCSI software initiator driver and is used to connect to iSCSI devices via the Windows TCP/IP stack using NICs.
- Microsoft MPIO Multipathing Support for iSCSI