We're two dates into our roadshow and I've twice been asked to do a comparison of VMware and Microsoft in the high availability area.
So lets go back to basics a second. Microsoft is involved in lots of areas of software, covering: Operating Systems, several different kinds of Virtualization, server Applications and Management software (a lot of customers are keen to Manage VMware with System Center Virtual Machine Manager - which is probably worth it's own post). We've got a history with high availability. Back in OS/2 LanManager days we had domains where any one of several machines could validate a logon. When we introduced WINS and DHCP in NT 3.5 we supported multiple servers being able deliver the same service to the same client. We have Network Load Balancing - Office Communications Server is designed to leverage it, and IIS in server 2008 is designed to play better with it. We introduced fail over clustering 10 years or so ago, and we're up to our 4th generation of it with Server 2008. Exchange, SQL, file shares and virtual machines can all be clustered. Clustering at the application level is THE only way to provide high availability over a wide range of problems. If the hardware fails, if the OS running the server application fails, if the application itself fails... application level clustering saves the day. If an application is critical of itself and can be clustered there is no excuse for not clustering it.
We see the main task of Hyper-V Servers as running a reasonably static collection of Server workloads. That's not to say workloads never move between servers: but they tend to stay put. It's not to say we never run client workloads using Virtualization; but usually Terminal Services is a better way to run many identical "virtual desktops" Running many clients as VMs has a much bigger disk, memory and CPU overhead: but in some cases it is still the best way to go. Companies who can sell you the same solution based on Terminal Services, or Client OS virtualization (ourselves or Citrix) will tend to go the TS route: patching and application deployment is simpler that way too. VMware don't offer that choice.
I talked about applications which are critical of themselves: over on the virtualization blog Jeff talked about consolidating applications which aren't critical individually, but move 5, 10, 20 such apps onto one server and that server becomes critical. If it fails unexpectedly your job's on the line. So, to allow VMs to live on shared storage and be failed over to another machine, VMware have their "HA" option and we use the clustering of Enterprise/Datacenter builds of Windows A by-product of clustering is the ability to migrate VMs from one box to another - this is quick but not "live" it involves a brief interruption of service.
This is the area where VMware have their major differentiator, VMotion. We know that some customers want to be able to move machines around with no downtime, and we've talked about it for a future version. I want to avoid getting into any criticism of the feature itself - with Microsoft not having it today that would have the tang of sour grapes to it. I don't think it is controversial to say VMware's software costs substantially more than Microsoft's nearest equivalent; to stay in business they need to offer features which justify that cost. VMotion is just such a feature, the problem is that VMotion is touted as the cure for all ills: which it is isn't. It lets you say "Move this machine", it copies the machine's memory to another host and switches over in under a second. But VMotion doesn't help with unplanned downtime (Jeff gives chapter and verse on VMware's HA document here). So Vmotion helps with planned downtime - patching or upgrading the host. As Jeff points out in a third post we think most customers - even the ones who have a live migration solution still warn people the system will go down and do the upgrade during off hours. If both host and guest are running Windows there is the possibility to patch the guests and take them down, patch the host, and then bring everything back up together.
One other thing about VMware's approach is that they make a feature of "sweating" the hardware to a higher level than we do - whether the workloads are client or server ones (See the argument about over-committing memory ). This means dynamically allocating resources and being able to move VMs from an overloaded box to an underloaded one. It's really a kind of "grid" computing where the workloads (VMs) float from host to host, cost makes it necessary and VMotion makes that possible. In the Microsoft world we tend to say spend the money you save from cheaper software on more hardware, so you don't have to sweat it as much; and workloads don't need to hop from box to box as frequently.
The problem i have with Microsoft's High Availability/ Clustering solution is that it requires a stand by server for every physical machine that you want to be backed up for failover.
Whereas VMware can have one standby server for x number of actual servers.
The other annoying thing is that MS solution needs two LUNs for each of the servers, one for Quorum and one for Storage. VMware shares a single LUN between upto 16 physical servers. So you could have 14 Active and 2 Standby servers for failover protection.
With HyperV, one would need 28 servers and 56 LUNs.
Just a correction on the above, one would need 28 servers and 28 LUNs, not 56.
Also, i dont know anybody who uses the overcommit feature in ESX for servers, may be for VDI, definitely not for servers.
If you include the cost of the extra physical servers that are required for HyperV, it will be about the same cost of the software as VMware. Except that you will save on electricity and cooling.
Well not many people are overcommiting indeed because of the smart page sharing mechanism vmware created. no need for overcommiting.
And with ESX 3.5 it's 32 Servers in a cluster and or 32 Servers attached to a single LUN. So make that 32 Active ESX Servers, no standby because you will have failover possibilities with using your hardware. The MS score would be 32 active and 32 standby with 32 LUNs. Well that would give you a nice consolidation rate I guess and really reduce the energy costs.
You can have more than 2 nodes in a cluster. So you'd build 2x8 note clusters with 7 active 1 passive.
Incidentally you don't need a quorum per VM. It is possible to have multiple VMs per LUN, but the UI doesn't facilitate it and it's not a supported configuration.
@Depping you could also have 8 all active nodes and acheive the same thing. I think we only go to 8 so you would have to have each one running at 7/8th capacity. VMWare could run at 31/32, against our 28/32
Thanks James for clarifying that. I wasn't aware of such a setup. Could you please provide a link to a setup guide or documentation on setting up an 8 Server cluster for HyperV, such that 7 are active and one passive with 1 Quorum and 1 Storage LUN. Thanks.
With Failover clustering we support op to 16 nodes on 64bit and you don't need to have a quorum disk you can use something that is called node majority. In this configuration you get the votes from the nodes instead of having a dedicated quorum disk.
Thanks James for the additional info! Sounds a lot better than the first outlines that were sketched.
James, thank you for blogging--your posts are a must-read for me.
I'm a little confused about HA/migration and Microsoft vs. VMware. I get the impression that on the Microsoft side, HA/migration requires creating a relationship--e.g., in a two-node cluster, you create a VM node on one physical machine, and create the second VM node on another physical machine. I think for a variety of practical reasons, the hardware is going to be identical, and I don't really care what physical box carries the failover. In other words, if I understand the Microsoft approach correctly, this is seemingly not a very flexible way to do it.
One of the other things points is that, in many cases, it would be nice to set up a cold standby node to save on licensing costs (SA gives you the right to have a cold standby server without paying for an additional license). Likewise, it doesn't matter much to me what server fires up the cold standby; whatever has the lightest load would be ideal. Is this an option?
Talking about failover, one thing I'd love to see from Microsoft (and I think VMware does not yet have this; Microsoft would also be in a much better position to offer it from a Windows perspective) is the ability to utilize VM technology to do live patching of servers. It seems not far-fetched to do something like this:
- identify the target running server
- snap-shot it
- make a copy of the VM image
- fire it up in a secondary off-network VM
- patch it, reboot
- test it
- copy the updated app data from the online server
- transfer the network connection from the unpatched server to the patched server
I think this would require some app-level support, but that could be vendor-certified. I would think Microsoft apps like Exchange, SQL, etc., could be supported fairly easily. Now wrap this whole sequence up so that SCVMM handles the low-level magic, and I think that would be a huge, huge hit....
@rhelmer. Thanks for the compliment.
A clustered service (File Server, Exchange Virtual Server, or VM) can go to any node of the cluster, and you can build clusters of up to 16 nodes. The service has an order of precedence for which node it wants to go to, and each node knows enough about the service which is to be failed to it that it can bring it on-line. The problem with this is not inflexible but having too many possible configurations.
You can't have a cold standby in a cluster, because it would appear to be off-line. So either you have one passive node in the cluster or you spread 1-node's worth of spare capacity around all the nodes.
Yes you can do the kind of "live patch" that you describe. Actually that's pretty much what you do when patching clustered Exchange or Clustered SQL. I'd say if an app was going to support that it could go the whole way and support clustering.
Thanks for the interesting questions :-)
Thank you for the clarification; I was getting worried that I was missing something after reading some of the other posts. My company is very small, so I was planning on starting with two nodes each using about half the physical memory (less overhead for the host) so I could do an active/active.
I was hoping to do it on Win 2008 from the start, but budget constraints and failing hardware made me go live on Win 2003 & Virtual Server 2005 R2. I'm using a SAS array for my shared storage and the only way I see to do it is one disk per VM and they each need a drive letter mapped, so I'm limited to ~22 VMs because of this. Am I missing something in my setup? Is there a different option in Win 2008 clustering?
@talking about overcomitment in VMWare
I was thinking this was kinda one of the purposes behind Microsoft's "hot add" technology (is that cancelled or just pushed back?)... you could say "This server runs best with 2G of RAM and two processors, but it CAN run with 1G and one processor if it has to"... then when you're failing over to a loaded up server, less performance critical apps could have the resources of their VMs scaled back (then back up when their home node comes online again).
You say "We have Network Load Balancing - Office Communications Server is designed to leverage it" - just to be clear, using NLB with OCS is absolutely NOT supported - you need a hardware load balancer.
Its only a small point but I thought I should highlight it in case anyones thinking of deploying it.
However, SharePoint does fully support Microsoft NLB.
@Chris. You don't have to assign a letter to the drives. I need to look up how you write the path without \\.\guid or some such thing I'll make that another post :-)
You can't really reduce the resources of a running machine. We've got support for hot add CPUs and RAM in the OS, but hot remove means that RAM that's holding something vanishes. We had hot-add in the alphas of Hyper-v (I showed it) but that's been pushed back - I posted about the reasons.
@Phil, yes, my typing got ahead of my brain on that one. When I read your comment I thought "I didn't say *that* did I ? and then "DOH"
Nice article, although I don't buy your overcommitment argument. For some reason I see it repeated in several Microsoft's blogs :-).
I'll bet money that in the next version of Hyper-V we'll see something similar to VMware's VMotion, despite current claims that customers don't really need it.