Some noticeable advantages to run applications in Windows Azure are high availability and fault tolerance achieved by the so-called fault domain and upgrade domain. These two terms represent important strategies adopted by Windows Azure for deploying and upgrading applications. With in this post and in all my articles, It should be noted that when discussing Windows Azure applications, Windows Azure and Fabric Controller (FC) are used interchangeably to represent the cloud OS in Windows Azure Platform, unless otherwise stated. And in the context of cloud computing, an application and a service are considered the same since all user applications are generally delivered as services. Also free trainings of Microsoft technologies including Windows Azure are available at http://aka.ms/va.
Fault Domain
The scope of a physical unit failure is a fault domain which is in essence a single point of failure. And the purpose of identifying/organizing fault domains is to prevent a single point of failure. In a simplest form, a computer by itself connected to a power outlet is a fault domain. Apparently if the connection between a computer and its power outlet is off, this computer is down. Hence a single point of failure. As well, a rack of computers in a datacenter can be a fault domain since a power outage of a rack will take out the collection of hardware in the rack similar with what is shown in the picture here. Notice that how a fault domain is formed has much to do with how hardware is arranged. And a single computer or a rack of computers is not necessarily an automatic fault domain. Nonetheless, in Windows Azure a rack of computers is indeed identified as a fault domain. And the allocation of a fault domain is determined by Windows Azure at deployment time. A service owner can not control the allocation of a fault domain, however can programmatically find out which fault domain a service is running within.
Specifically, Windows Azure Compute service SLA guarantees the level of connectivity uptime for a deployed service only if more than one instance of each role of the service are specified by the service owner in the application definition, i.e. csdef file. Under this assumption, Windows Azure by default deploys the role instances of an application into "at least" 2 fault domains, which ensures fault tolerance and allows an application to remain available even if a server hosting one role instance of the application fails.
Upgrade Domain
On the other hand, an upgrade domain is a strategy to ensure an application stays up and running, i.e. highly available, while undergoing an update of the application. Windows Azure distributes the role instances of an application evenly when possible into multiple upgrade domains with each upgrade domain as a logical unit of the application’s deployment. When upgrading an application, it is then carried out one upgrade domain at a time. The steps are: stopping the instances of an intended role running in the first upgrade domain, upgrading the application, bringing the role instances back online followed by repeating the steps in the next upgrade domain. An application upgrade is completed when all upgrade domains are processed. By stopping only the instances running within one upgrade domain, Windows Azure ensures that an upgrade takes place with the least possible impact to the running service. A service owner can optionally control how many upgrade domains with an attribute, upgradeDomainCount, in service definition, i.e. the csdef file of an application. Below shows what is documented on the attribute in MSDN. It's however not possible to specify which role is allocated to which domain.
Observations
Within a fault domain, there is no concept of fault tolerance. Only when more than one fault domains are managed as a whole, is fault-tolerance applicable. In addition to fault domain and upgrade domain, to ensure fault tolerance and high availability Windows Azure also has network redundancy built into routers, switches, and load-balancers. FC also sets check-points and stores the state data across fault domains to ensure reliability and recoverability.