Embrace Failure

The nature of failure is changing, the way our systems behave (or misbehave) as a whole is changing... In order to rise to these challenges successfully, it becomes necessary to ... change the way we build and operate software, [and to change how and what we monitor] medium

Opting in to the model of embracing failure entails designing our services to behave gracefully in the face of failure. In other words, this means turning hard, explicit failure modes into partial, implicit and soft failure modes. Failure modes that could be papered over with graceful degradation mechanisms like retries, timeouts, circuit breaking and rate limiting. Failure modes that can be tolerated owing to relaxed consistency guarantees with mechanisms like eventual consistency or aggressive multi-tiered caching. Failure modes that can be even triggered deliberately with load shedding in the event of increased load that has the potential to take down our service entirely, thereby operating in a degraded state.

But all of this comes at the cost of increased overall complexity and the buyer’s remorse often acutely felt is the loss of ability to easily reason about systems.

Which brings me to the second characteristic of “monitoring” — in that it’s human centric. The reason we chose to “monitor” something was because we knew or suspected something could go wrong, and that when it did go wrong there were consequences. Real consequences. High severity consequences that needed to be remedied as soon as possible. Consequences that needed human intervention.

Now I’m not someone who believes that automating everything is a panacea, but the advent of platforms like Kubernetes means that several of the problems that human and failure centric monitoring tools of yore helped “monitor” are already solved. Health-checking, load balancing and taking failed services out of rotation and so forth are features these platforms provide for free. That’s their primary value prop.

With more of the traditional monitoring responsibilities being automated away, “monitoring” has become — or will soon be — less human centric. While none of these platforms will truly make a service impregnable to failure, if used correctly, they can help reduce the number of hard failures, leaving us as engineers to contend with the subtle, nebulous, unpredictable behaviors our system can exhibit. The sort of failures that are far less catastrophic but ever more numerous than before.

....

There’s a paradox. Minimizing the number of “hard, predictable” failure modes doesn’t in any way mean that the system itself as a whole is any simpler. In other words, even as infrastructure management becomes more automated and requiring less human elbow grease, application lifecycle management is becoming harder.

There's a lot more in this article by Cindy Sridharan ( @copyconstruct ). Here are some points that I want to expand upon:

• Embrace failure. There's too much complexity already to prevent failure. This is actually an old idea. How do you build systems to fail safely.

• Judicious use of automation, and thoughtful design for failure empower the humans in the system to actively monitor fewer individual things and to explore and investigate the operational behavior of systems of increasing complexity.

• The future is not evenly distributed. Ward's recent work a New Relic is a glimpse into my future:

Here we collect various mentions of the work we've done observing software through the metadata produced throughout its creation and operation.