By David Bills, chief reliability strategist, Microsoft

When I speak with customers, they often ask how they can successfully change the culture of their IT organization when deciding to implement a resilience engineering practice. Over the past decade I’ve collected a number of books and articles which I have found to be helpful in this regard, and I often recommend these resources to customers. I’ve included my favorites below, in no particular order, with a short explanation of why I’m recommending them.

1.    The Visible Ops Handbook by Kevin Behr, Gene Kim and George Spafford
One of the first steps to take before you implement a resilience engineering practice in your IT organization is to identify the current sources of reliability-related pain. For example, where are the pleas for help emanating from inside the IT organization? Any suspicions about why those pleas are being vocalized? Are there signs of general malaise? Is it confined to Engineering? Operations? Customer Support? IT leadership? … Even your customers? This book provides practical advice on how to implement a continual improvement process inspired by the Information Technology Infrastructure Library (ITIL).

2.    Eliminating the Mean Time to Innocence by Jim Metzler and Steve Taylor
One of the key influences on the success of any organizational change management program is the existing culture. Are people generally collaborative when triaging incidents or is there a ‘…better prove it’s not my fault!’ mentality? Jim Metzler coined the phrase “mean time to [declare] innocence” a number of years ago in this article he wrote for Network World, and I believe this is the number one undesirable behavior that needs to be eradicated with any resilience improvement effort being undertaken by an IT organization.

3.    Six Thinking Hats by Edward de Bono
I have found Edward de Bono’s Six Thinking Hats to be an extremely powerful methodology when used to ferret out the trouble spots in your organization when it comes to collaborative problem-solving. Be diligent in terms of wearing all six “hats”, and wear them in the order they’re meant to be worn. The approach de Bono has devised will compel a team to examine a problem from multiple perspectives, ranging from exploring what’s at risk to what benefits could be had by solving the problem to just making sure everyone is equipped with all of the facts.

4.    The Goal by Dr. Eliyahu M. Goldratt or The Phoenix Project by Kevin Behr, Gene Kim and George Spafford
I’ve had the privilege of spending time with both Gene Kim and Kevin Behr during my career, and regard their books as essential reading for any IT professional. The Goal was my introduction to the theory of constraints, while The Phoenix Project “made it real” for me because of the IT-centric nature of the plot. The concepts Goldratt wove into the gripping narrative of The Goal have had a profound influence on me, as well as countless other leaders facing tight deadlines, demand that seems to perpetually exceed supply, and an absolute need to deliver high-quality products, (or services), on a continuous basis. The trio responsible for writing The Phoenix Project built on those concepts, but in an IT-specific context, which made it extremely relevant for me. If manufacturing plays an important role in your professional life, pick up The Goal. If you’re leading an IT organization, or are an IT professional interested in improving the performance of your organization, pick up The Phoenix Project.

5.    Managing the Unexpected by Karl Weick and Kathleen Sutcliffe
In this book, Weick and Sutcliffe discuss the five basic concepts of high reliability organizations: 1) Preoccupation with failure, 2) Reluctance to simplify interpretations, 3) Sensitivity to operations, 4) Commitment to resilience, and 5) Deference to expertise. They provide additional context around each concept and I believe anyone who reads this book will come away with a better understanding of how to apply High Reliability Organization (HRO) principles when designing, developing, testing, deploying and operating complex systems based on contemporary cloud computing platforms like Microsoft Azure.

6.    The Black Swan and Antifragile by Nassim Nicholas Taleb
These are two books that every services engineering lead should read. Why? The concepts Taleb discusses in both books are directly applicable to how a highly-functioning engineering organization could, or should, systematically approach the software development lifecycle when a resilient online service is what that organization is striving to build. I would also argue that much of what Taleb advocates applies to operational practices as well. To be clear, I am not conflating the notion of resilience with the notion of “antifragile”. While a resilient online service is often the best outcome an IT organization can hope for, the advent of “big data” and increasingly more sophisticated cloud computing platforms means the future may very well include “antifragile” computing environments, rather than merely resilient ones.

I hope you find some of these resources useful. I also encourage you to find out more about Microsoft’s continued focus on reliability by visiting the Trustworthy Computing website.