David Bills, Chief Reliability Strategist, Microsoft

In a recent post, I shared a short list of my favorite books and articles, related to reliability. Each one has influenced my thinking with respect to how to go about creating a high-performing IT organization, despite the fact not all of these publications are IT-centric in terms of subject matter. In this post, I’m going to take a closer look at “Antifragile”, the 2012 book written by Nassim Nicholas Taleb, and describe why I think the concept of antifragility is particularly applicable to cloud computing.

Antifragile is the term meant to describe the exact opposite of fragile. It’s not the same as robust or resilient, two terms often conflated with the notion of antifragility, and two terms I’ve used to describe desirable attributes often associated with well-designed and well-managed online services. When customers say the cloud service they’re reliant upon is “robust” or “resilient”, we, as the IT professionals responsible for that service, can be justifiably proud of our efforts.

The term antifragile is meant to describe objects that actually benefit from experiencing some form of failure or stress. In the context of IT, we’d probably say systems or services. You might be familiar with the notion of self-healing IT systems, but in this context, I’m talking about something a little different – a truly antifragile system will not only heal; it will strengthen itself against future stresses of a similar nature.

The resilience engineering practices we promote at Microsoft are anchored in the fundamental premise that service failure is inevitable. If failure is a persistent state, then those practices must help online service engineering teams understand how to:

  1. evaluate their systems with an eye toward identifying failures that represent the greatest risk to the customer experience,
  2. devise the appropriate remediation mechanism, and
  3. regularly exercise those mechanisms to verify their effectiveness.

You can begin to see how the notion of antifragility would be extremely attractive to an IT professional interested in producing an online service that is not only capable of full recovery, but actually ends up in a better state than it was before the failure.

As IT professionals, historically, our objective has been to design, build and operate systems to solve a particular computing problem. Often there are requirements describing specific operational characteristics meant to deliver a consistent experience to the customer, (e.g. an availability metric, a latency metric, a transaction success rate metric, etc.), and it’s not uncommon for these requirements to be predicated on the assumption the production environment is in a perpetually pristine state. Tell me, how many IT professionals have you encountered recently who claim to be managing an online service that’s operating in one of these so-called pristine environments?

As an industry, we need to acknowledge the typical production environment most online services are being operated in is anything but pristine! More likely, it’s an extremely fluid environment, an environment full of the so-called stressors Taleb references in his book, and an environment where these challenging conditions need to be dealt with in a highly-automated manner.

That’s the primary point I’m emphasizing here – those stressors are, in fact, always present – so rather than focus exclusively on merely restoring the service, is there an opportunity to learn from the telemetry being collected and strengthen the behavior of the online service on a go-forward basis without the need for human intervention? Said differently, if the production environment is, in fact, continually having to detect, contain and recover from a myriad of failures, doesn’t it make sense to adopt the principle of antifragility as a fundamental design goal when considering how to architect a contemporary online service?

I’m convinced this evolution in thinking represents a significant opportunity for IT professionals to advance the state of the art with respect to how online services move beyond the traditional boundaries of resilience or robustness, and begin to exhibit characteristics more closely associated with the principle of antifragility.

I welcome your thoughts about how your high-performance IT organization is pursuing this new, and exciting, design goal.