The real challenge for resilience engineering is to tell management "you cannot get away with using a few metrics to assess this stuff". The appeal of metrics give management something visible—so much of engineering work is just not visible to management. How do you justify this expensive work of deliberate reflection to management?
This page is focusing on one of many great points Lorin Hochstein shares in a couple recent podcasts.
StaffEng: Interview with Lorin Hochstein podcast
.
Software Misadventures: Interview with Lorin Hochstein podcast ![]()
The thing with resilience engineering is that you have to be able to justify this even when you can't show a metric for it.
A huge factor in whether this succeeds or fails is if you have support from your immediate management culture. You can change your team (e.g. persuade your chain of command) or you can change your team (e.g. move to a different team or company).
I was lucky. I convinced my manager that this was important. And my skip-level manager was also into it.
I can't give you a metric of the incidents that didn't happen. That's the metric you really want, but you can't have it. So you kinda have to infect management.
More interesting than metrics is all the stuff you cannot see through metrics. How do leaders get signals about these risks that are invisible to metrics.
If you can show leadership insights drawn from the incidents, that's how you can justify the work.
Metrics are throwing away a lot of stuff.
The stuff we cannot see is where we're gonna get hurt.
What kills me is all the time we spend trying to decide which minute the incident started or ended. Or deciding "do we categorize this as a config change 'cos we turned on debug logs, or capacity planning 'cos we ran out of disk space?" All of that time spent is waste and opportunity cost.
The qualitative analysis has never felt like time wasted. I've learned something surprising and valuable every time I've interviewed someone about an incident.