Kevin Remde's IT Pro Weblog
Okay.. I feel like sharing this because it’s pretty stupid, but in a geeky-sort-of-way the solution was interesting enough to share. Think Chicken & Egg. (or “Catch-22”).
As the title of this post suggests, the subject is Windows Failover Clustering. For those of you who are not familiar with it, Windows Failover Clustering is a built-in feature available in Windows Server 2008 R2 Enterprise and Datacenter editions. Along with shared storage (for which we used the free iSCSI Software Target from Microsoft to implement), it provides a very easy-to-configure and use cluster for serving up highly available services. In our case, this would be virtual machines running on two clustered virtualization hosts.
As a training platform, but primarily for use as a demonstration platform for our presentations (and certainly more real-world than one laptop alone can demonstrate), our team received budget to acquire several Dell servers. We found a partner (Thank you Groupware Technology!) who was willing to house the servers for us. The idea was that we, the 12 IT Pro Evangelists (ITEs) in the US would travel to San Jose in groups of 3-4 and do the installation of a solid private cloud platform, using Microsoft’s current set of products (Windows Server 2008 R2 and System Center). This past week I was fortunate enough to be a member of the first wave, along with my good buddies Harold Wong, Chris Henley, and John Weston. The goal was to build it, document it, and then hand if off to the next groups to use our documentation and start from scratch, eventually leaving us with great documentation, and a platform to do demonstrations of Microsoft’s current and future management suites.
We all arrived in San Jose Monday morning, and installed all 5 server operating systems in the afternoon. We installed them again Tuesday morning.
It’s a long story involving how Dell had configured the storage we ordered. We needed to swap some drives between machines and set up RAID and partitioning in a way that was more workable to our goals. I’ll leave that discussion for one of my teammates to blog about.
Anyway, once we had the servers up, I installed and configured the Microsoft iSCSI software target on our “VMSTORAGE” server, and configured two other servers as Hyper-V Hosts in a host cluster, with Windows Failover Clustering and CSV storage. By the end of the week we had overcome hardware, networking, missed-BIOS-checkmarks (did you know that Hyper-V will install, but you can’t actually use it if you somehow miss enabling Virtualization support on the CPU on one of the host cluster machines? Who’da thunk it?!) , we had 5 physical and a half-dozen virtual servers installed and running, with Live Migration enabled for the VMs in the cluster. Our domain had two domain controllers; one as a clustered, highly-available VM, and the other as a VM that was not-clustered, but still living in the CSV volume; C:\ClusterStorage\Volume1 in our case. (That’s a hint, by the way. Do you see the problem yet?)
One of the many hurdles we had to overcome early on was an inadequate network switch for our storage network. 100Mbps wasn’t going to cut it, so until our Gig-Ethernet switch arrived on Friday, Harold used his personal switch that he carries with him. On Friday before we left for the airport, we shut down the servers and let the folks there install the new switch. Harold need his switch back at home.
But in restarting the servers, here’s the catch: Windows Failover Clustering requires Active Directory. The storage mount-point (C:\ClusterStorage\Volume1) on our cluster nodes requires the Failover Clustering. And remember where I said our domain controllers were?
“Um.. So… Your DCs couldn’t start, because their location wasn’t available. And their location wasn’t available, because the DC’s hadn’t started. And your DC’s couldn’t start, because their storage location wasn’t available, and… !!”
Bingo. Exactly. Chicken, meet Egg. It was our, “Oh shoot!” moment. (Not exactly what I said, but you get the idea.)
“So how did you fix it?”
I’ll tell you…
Our KVM was a Belkin unit that supports IP connections and access to the machines through a browser. We configured it to be externally accessible. So I was able to use that to get in to the physical servers and try to solve this “missing DCs” puzzle; though to make matters much more difficult, the web interface for that KVM is really, REALLY horrible. The mouse didn’t track to my mouse directly, no ALT+ key support, TAB key didn’t work.. I ended up doing a lot of the work from a command-line simply because it was easier than trying to line up and click on things! Perhaps in a future blog post I will give Belkin a piece of my mind regarding this piece-of-“shoot” device…
So, my solution was based on two important facts:
“Ah ha! So on the storage machine, you mounted the .VHD file that was your cluster storage disk, and you copied out the .VHD file from one of the domain controller VMs!”
Yeah.. that’s basically it. Though I did have one problem. The .VHD file was in-use; probably by the iSCSI Software Target service. So when I tried to attach it, the OS wouldn’t let me.
Fortunately I found that by stopping that “Microsoft iSCSI Software Target” service on the storage server (I also stopped the “Cluster Service” on the two Hyper-V cluster nodes), I was able to attach to the .VHD, navigate into it, and copy out the .VHD disk for the needed Domain Controller. (Actually, I also removed the .VHD from its original location. I didn’t want the DC to come alive again when the storage came back online, if the identical DC was already awake and functioning.)
So after that, it was as simple (?) as this:
Everything came back to life almost immediately; including the Remote Desktop Gateway that we had configured so that we could remotely connect to the machines in a more meaningful, functional way.
So the moral of the story is: When you’re building your own test lab, or even considering where to put your DCs in your production environment, make sure you have at least one DC that comes online without depending upon other services (such as high-availability solutions) that, in turn, require a DC to be functioning.
All-in-all, it was a great week.
Do you have any similar stories? Share them with us in the comments. We’d love to hear ‘em!
Good post, Kevin. Sounds like you earned an MCSE the hard way. =-) If you play "Top this story", you'll probably win. That said, you probably can give the Belkin folks a LOT of good information about how to improve their product to be more usable in extreme situations.