Wednesday's event in London was fun, and we managed not to talk too much about VMware's mishap: I joked at one point that once the VMware folks would be making T-Shirts in a few months with "I Survived August 12th, 2008" . (Buried in the back of my mind was an Old post of KC's about the a "Bedlam event" The thought came up afterwards that maybe we should get some done with "I was running Hyper-V " on the back.
We took the same "Scrapheap challenge" approach to build a Virtualization cluster that we used for Virtual Server a year ago: scavenge what you need from round the office and build a cluster: with the message: "You can do this at home folks - just don't do it production". Seriously. We had two cluster nodes and their management network, heartbeat network, shared storage network, and network for serving clients was the same 100Mbit desktop hub. It works; use it to learn the product, but don't even think about using in production. By the way, these days we have "Computer Clusters" and "Network Load-balanced clusters" so we try to makes sure we refer to traditional clustering as Failover Clustering.
One of our computers was a Dell Inspiron 5160 which is 4 or 5 years old. It ran as a domain controller - both cluster nodes were joined to the domain - it hosted DNS to support this and gave us our shared storage; using a "hybrid" form of Windows Storage server - basically the internal iSCSI target bits on Server 2008. I think machines of that age have 4200 RPM disks in them, Steve thinks it's 5400, either way with the network we had for iSCSI it was no speed demon (again this was intentional - we didn't want to show things on hardware so exotic no-one could replicate when they got home).
We set-up two iSCSI targets on the the 5160, each with a single LUN attached to it. One small one to be our cluster quorum, one big one to hold a VM. In rehearsal we'd connected to these from one of our cluster nodes and brought the disks on-line (from the disk management part of server manager), formatted them and copied a VHD for a virtual machine to the large one. I've found that once the iSCSI initiator (client) is turned on from the cluster nodes, the iSCSI target (server) detects its presence and the initiators can be given permissions to access the target.
Our two cluster nodes were called Wallace and Gromit. They're both Dell Lattitude D820s although Wallace is 6 months older with a slightly slower CPU and a slightly different NIC. Try to avoid clusters with different CPU steppings, and mixing Intel and AMD processes in the same cluster can be expected to fail. Both were joined to the domain, both had static IP addresses. Both had the standard patches from Windows updated, including - crucially the kb950050 which is the update to the release version of Hyper-V. We didn't install the optional enhancements for clustering Hyper-V. On each one, in the iSCSI initiator control panel applet we added the 5160 as a "Target Portal" (i.e. a server with multiple targets) and then on the targets page we added the two targets, and checked the box saying automatically restore connections. The plan was to disconnect the iSCSI disks on Wallace but they were left connected at the end of rehearsal.
Gromit had Hyper-V and fail-over clustering installed, but we wanted to show installing Hyper-V and failover clustering on Wallace, so we installed Hyper-V - in server manager, add a role, select Hyper-V and keep clicking next. On these machines it takes about 7 minutes with 2 reboots to complete the process. One important thing if you are clustering hyper-V the network names must be the same on all the nodes of the cluster. It usually best NOT to assign networks to Hyper-V in the install Wizard and do it in Hyper-V's network manager (or from a script) to make sure the names match.
Then we installed Failover clustering from the features part of server manager, no-reboot required. We went straight into the Fail over clustering MMC (on the admin tools part of the start menu), we chose Create a cluster and it only needed 3 pieces of information.
At the end of the process we had a report to review - you can validate a cluster configuration and check the report without actually creating it. In the disk manager part of server manager we the state of the ISCSI disk had changed to to reserved on both nodes, and one node will see the disks as available - in our case this was Wallace. We found that the cluster set-up Wizard made the big disk the cluster Quorum and left the small one for applications, to fix this we right-clicked the Cluster in the Failover clustering MMC, and from the "more actions" menu, went through cluster settings/Quorum settings and changed it
The next step was to build a VM, and we just went through the new Virtual Machine Wizard in the Hyper-V MMC on Wallace. The important part was to say that configuration was not in the default location but on the shared clustered drive. We didn't connect the demo machine to a network (we hadn't configured matching external networks on the two nodes) , and picked a pre-existing virtual hard disk (VHD) file on the same disk. We left the VM un-started, and we should have set the Machine shutdown settings for the VM - by default if the Server is shut down the VM will go into a saved state, which is not what you want on cluster (if you follow the link to the clustering information from the Hyper-V MMC it explains this).
Finally back in the Failover clustering MMC, we chose add clustered application/service, selected Virtual Machine from the list of possible services, and the clustering management tools discover which VMs exist on the nodes and are candidates for clustering. We selected our VM and clicked through the wizard. In Clustering parlance we brought the service on-line - or as most people would say we started the VM. Steve showed the VM - which was running Windows server core - we don't bother to activate demo VMs and this one had passed the end of its grace period, it was still running [Server 2008 doesn't shut you out when it decides it's expired]. I killed to power on Wallace, switched the screen to Grommit to see the Virtual Machine was in the middle of booting back into life. From starting the presentation to this point had taken 35 minutes.
We showed "quick migration" - which is simply moving the cluster service from one node to another. With the quick migration we put the VM into a suspended state on one node, switch the disks over to the other node and restore the VM. How quick this is depends on how much memory the VM has and how fast the disk is. We were using the slowest disk we could and it took around 30 seconds. If total availability is critical then the service in the VM should be clustered, but if isn't there's a short period where the service is off-line. Matt chipped in and showed his monster server back in Reading doing a failover and it was very quick - round the one second mark - but each of his disk controllers cost more than our entire setup.
I'm going to try to capture a video of what we did and post it next week. Watch this space as they say.
woah, this event soundz awsome. ha ha, u guys killd poor wallace! anychance of getting sum footage streamed on2 second life or somethng?