Failure Is Not An Option (It’s a Core Feature!)
This blog post is based on a presentation I gave at DevOpsDays Newcastle, 25th October 2018. As I prepared the talk, I realised many of the things I wanted to say I had already printed on T-shirts. That lead to what I believe was the first DevOpsDays talk by T-shirt. I’ll link to a video of the presentation when it is available. This blog post is an abridged version of the talk, but does contain many of the T-shirts!
Many DevOps best practices are counter-intuitive in some way. It’s my belief that failure is actually a feature. That is, it’s not enough just to recover well from failure if it happens, but rather “failure” is necessary to succeed, and that (for some definition of failure) we want it to happen.
Having said that, no-one gets up in the morning and comes to work to do a bad job. No-one actually wants to fail. So why am I claiming failure is a feature?
The First Way: Flow
As you probably know, the DevOps movement was born out of a need to bridge the gap that often exists between development and operations. This gap exists because of what Goldratt calls the core chronic conflict. The very reason developers exist is to bring about change. Whether the changes are for new features or to fix old ones, every development is intended as change. Operations, on the other hand, are tasked with keeping everything up, and running, and stable, and reliable. Change is the enemy stability.
So the first way of Dev Ops is to create a fast flow of work, not only from development into operations, removing any impediments, but also from business concept to being used by customers.
Fail safely–Version Control
One of the tools we use to create a fast flow of work is version control. Having code checked into a repository allows us to work from a common base, where we can make version changes, try them out, and revert them if they fail. You can think of version control as an enabler of failing safely. It allows you to both fix forward and to roll back. If all else fails, reach for the version control ejector seat.
Fail intentionally–Move Faster
Perhaps the most identifiable characteristic of DevOps, however is the build pipeline often depicted as an infinite cycle of plan, code, build, test, release, deploy, operate, monitor, repeat. I want to suggest that the most important part of creating flow, however is not the automation of these processes, but the automation of stopping incomplete or incorrect work from reaching the next stage.
One of the measures recorded in a lean value stream mapping is the percentage of complete and accurate work received from the up stream work centre. Any work received that requires more information or needs to be reworked slows down flow.
Therefore the most important part of any build pipeline is the ability to break the build. Many organisations have some kind of IOT indicator hooked up to the build pipeline; flashing lights, traffic lights, lava lamps, audible klaxons, and even USB controlled foam rocket launchers. These should be a signal for celebration, or at least learning. By deliberately building in automatic tests, static code analysis, compliance checks, and vulnerability scans, that can break the build we prevent defective work flowing downstream where it will be much more costly to remediate.
Fail cheaper–Infrastructure as code
Another way we create flow is to define our infrastructure as code. Gone are the days where a project manager has to go through a procurement process to purchase servers before a development can begin, wait for order fulfilment, delivery, racking, networking, commissioning, hardening and handover to the development team.
More importantly we make it very easy to test everything in a production identical environment. In the past, development and test environments were often cut down, less expensive caricatures of the real thing. If the first time your software runs in a production like environment is in actual production, then you’re really running your production in your test environment. We want to bring that kind of testing up stream to a test environment that is production identical, so we can surface any environment dependency based failures earlier.
In other words – we want to fail earlier.
The Second Way: Feedback
The second way of DevOps is feedback, or more expansively, creating and amplifying feedback loops. A common practice in DevOps based organisations is to put developers on call with operations. In Amazon this is famously known as “You build it, you run it”. At Google they do it slightly differently, with all services initially supported by developers. Some services may be handed over to SRE support, when proven to be reliable and must pass a readiness review. They also have a hand back process should the product become unstable as a result of change.
By bringing developers closer to the failure signal, they have a better sense of what it takes to actually create business value, and write robust software that stands up to operational pressures. This creates a greater sense of succeeding together as a business, and helps provide some balance to the problem of product managers and IT budget owners always prioritising new features over paying down technical debt, effectively meaning that errors get fixed faster.
As one Google CRE blogger put it, “When one team develops an application and another team bears the brunt of the operational work for it, moral hazard thrives“.
Fail smarter–instrument everything
To make the most of any feedback signal (and failures are strong feedback signals), we want it to be information rich. We need to define metrics at every layer.
- business metrics – conversion rates, abandoned carts, recommendation engine success rates, time from sign up to first transaction
- application metrics – execution paths, session durations, database query times
- Infrastructure metrics – CPU, RAM, Disk I/O, health check response times
- deployment metrics – blue green releases, canary deployments, A/B testing events
We also need to invest in telemetry to not only generate metrics, but to correlate and overlay the events. A 30% decline in conversion rate, occurred when database response times showed greater than normal latency, which happened after disk I/O went through the roof, which happened around the same time the patch bot deployed OS level patches.
Fail earlier–Shift left
Let’s revisit our CI/CD pipeline and talk a little bit about our automated test suite. We’ve already talked about the value of breaking the build in preventing bad work from proceeding downstream. I’m sure many of you are familiar with the test pyramid as popularised by Martin Fowler, where we should aim to have the majority of our tests where they are cheaper to run and easier to automate, and will provide faster feedback, and fewer of our tests where they are costly, slow, and can’t be automated.
One thing we should strive to do as we amplify feedback signals is to try to move our failures back upstream. About a year ago I was working on changes to an application router, used by almost all my client’s web applications. All the changes seemed to be working as designed in my test environment (works on my machine – right?), and all the unit tests were passing, so it was automatically deployed to the test account where it proceeded to break everyone else’s applications, failing the most important integration test of all. I then spent a lot of time trying to discover why it was failing, and how I could write a unit test that would pick up that failure, before proceeding to then write the code that would cause that test to pass, effectively bringing what could have been written as an integration test, back upstream to a unit test. You’ll often hear the phrase “shift left” when talking about moving things upstream in the flow of work, but in this case you could also call it “shift down” (the testing pyramid)
The Third Way: Continual Experimentation and Learning
Now that we have established flow and feedback, we are in a position to start to iterate faster, perform experiments, and continue to get better at what we do. This is the Third Way of DevOps. As Andrew Shafer put it so well, “You’re either a learning organisation or you’re losing to someone who is.”
Fail frequently–Out experiment the competition
You may have heard the phrase “out-experiment the competition”. Whether you realise it or not, out-experimenting the competition means failing more often than they do. Having an overall cadence and tempo that allows us to learn faster. Learning what doesn’t work is just as important as, if not more so, than learning what does.
Ronny Kohavi at Microsoft found using rigorously designed AB experiments, that 1/3 of proposed new features made no effective improvement to the target metrics, and 1/3 actually had a negative impact, making the product worse. By performing these types of experiments with quick prototypes before full investment, you can avoid wasting money on development that does not make things better. For the third of features that actually made things worse, it would be a better investment of money to have all your developers on leave, than being paid to make your product worse.
Fail chaotically–Survive because you’re used to it
No discussion of failure in DevOps would be complete without covering these guys – the Simian army. While many people will know of their existence and what they’re all about, let me briefly re-introduce you to Chaos Monkey and friends, in the same way the world first heard about them.
On April 21, 2011 AWS US-East region effectively lost an entire Availability Zone. Many large, high profile customers were affected and experienced significant outages including Reddit and Quora. Netflix was seemingly unaffected and some speculated that they were somehow given special treatment by AWS because they were such a valuable customer. That’s when the world learned about Chaos Monkey.
In 2009 when Netflix embarked upon a “Cloud Native” transformation, they wanted their developers to get used to the idea that the underlying infrastructure in the cloud could fail at any time, and that they needed to design their application architectures to be resilient in the face of such failure. So, they taught them about designing for failure, and showed them the patterns to use, then gave them all pagers, and let Chaos Monkey loose in production. Basically, Chaos Monkey is a service that randomly kills other services. Netflix survived the 2011 AWS outage, because for two years they had been running with servers regularly being killed without warning in production.
It may be surprising to learn this can also be achieved at the persistence layer. You may wish to check out how Netflix survived The great Amazon reboot of September 2014. In short – 218 Cassandra nodes were rebooted by Amazon. 22 didn’t come back but were detected by their automation, and remediated. Netflix experienced zero downtime.
As John Allspaw beautifully put it, “Incidents are unplanned investments in your company’s survival”. You have no control over when, and how much that investment will be. The only thing you can control is how will you maximise the ROI.
Blameless post-mortems are a way of conducting Post Incident Reviews (PIRs) so that people feel safe to relate the events that actually happened without fear of punishment. John Allspaw’s work at Etsy brought them to many people’s attention, though I have always considered them to be the logical descendants of Gerry Weinberg’s “Egoless programming” first described in his 1971 book The Psychology of Computer Programming.
Psychological safety is the belief that no one will be punished or humiliated for speaking up with ideas, questions, concerns or mistakes. This was far and away the most important dynamic that set successful teams apart at Google.
If you want to maximise the corporate learning from your unplanned investment, you need people to be able to honestly relay the events that happened, which requires a confidence that they won’t lose their job or suffer some other punitive outcome. As you reconstruct what happened it is important to not, with the benefit of hindsight, describe what should have been done. Things that are obvious once you know the conclusion, are not at all apparent when the outcomes are still unknown.
Remember that people don’t come to work to do a bad job, and the decisions they made as events unfolded were the best decisions given the available information on hand. Any negative outcome or result is not a reflection of their character.
Make sure all your PIRs are targeted at learning, and not just fulfilling an obligation to hold them. Do you only focus on why or how things went wrong? Do you ever ask what you can learn from why the incident wasn’t worse than it was?
Moreover, don’t only focus on getting action items. Yes, we want to figure out what counter measures we can deploy to avoid this in the future–we are preparing for a future where we are just as stupid as we are today. But learning is not just action items being implemented–it’s also about individuals learning from others, and developing a better picture and mental model of how our systems work, and how they fail. Don’t be tempted to rush through the stuff that “everybody already knows”, or you will shortchange yourself and others of valuable learning opportunities.
Fail freely, but with accountability
I can imagine some of the managers I have worked for, over my career, struggling with the idea of making it safe for people to fail without fear of blame or sanction. “Senior management, and the executives hold me accountable for results. How am I supposed to get people to perform if I accept failure without consequence?”
This is a false dichotomy and not borne out by the research. Psychological safety is not the opposite of accountability. In professions where accountability to meet demanding goals has life and death consequences, practitioners don’t want to find a safe haven from accountability. In many cases the responsibility that goes with the job is what attracted them to the profession, but that does not mean that they won’t benefit from being able to tender their thoughts and experiences in a psychologically safe manner.
Rather, psychological safety and accountability to meet demanding goals are two different dimensions. Amy Edmondson divides it like this.
- Where psychological safety is low, and accountability is low, you’re in the Apathy Zone. This is the realm of low performing bureaucracies, where people spend more time jockeying for position than achieving outcomes.
- If you have a high level of psychological safety, but low accountability, you’re in the Comfort Zone. Everyone is happy coming to work, but not much is getting done. This might be a small family business which makes a living for everyone, but no-one is being stretched, and they are not going to grow an empire.
- Of much greater concern is the quadrant with high accountability but low psychological safety. This is the Anxiety Zone, where people fear termination, and won’t try new things or ask for help. If you want to out-innovate the competition, this zone is death.
- Obviously this is where we all would rather be, in the learning zone, where we are safe to speak up about concerns, mistakes, ideas, or failures, yet we are all committed to collaboration and learning, in the service of high performance outcomes.
I hope in reading this, you may have begun to see that while failure in itself may not usually be considered good, DevOps practices can take it captive, make it submit to the higher purposes of an organisation, and allow you to treat it like a feature. In DevOps we want and need failure–we benefit from it tremendously. Learn to see it as positive and it will transform the way you think about enabling your organisation.
Here is the Failure Is Not An Option (It’s a Core Feature!) presentation slides. Wishing you every success in your future use of failure.
Looking for a new career, get in touch with Gladis Carvana.