Elisa Binette and Beth Adele Long at Continuous Lifecycle London 2019. Fighting fires for fun profit: How to be a great Incident Commander. We’ll explore the basics of incident response and coordination, how to be effective as an incident commander (IC), and how organizations can cultivate a strong pool of ICs so that both engineers and customers are happier. program
youtube ![]()
YOUTUBE pFTohdgeG1s
Elisa Binette and Beth Adele Long at Continuous Lifecycle London 2019: How to be a great Incident Commander. youtube ![]()
The sharp end in software are the people who carry the pager and deal with incidents. Akin to firefighters.
New Relic's incident response was born after growth had outstripped processes. Systems were metaphorically on fire a lot. One of the lead SREs had previous career as a firefighter and brought insight from Incident Command System (ICS) to help design the new approach.
ICS: Defined severities to know out how many resources to devote. Three clear roles: Tech Lead who troubleshoots; Communications Lead, the liaison to the organization and our customers; and Incident Commander, who coordinates the response—and who is focus of this talk.
2017 Amazon S3 us-east region went down for 14 hours. New Relic had a massive response that demanded an emergency commander and several incident commanders who fanned out to manage response in different areas of the organization.
Even very small incidents benefit from having an incident commander.
Incident as a term has a lot of ambiguity and room for interpretation. For our purposes, an incident is an event that triggers the incident response process.
Incident Lifecycle: sometimes slow onset of symptoms, incident declared (ICS begins), triage to assess severity and impact on customers, diagnosis and troubleshooting, eventually a judgement call that things are fixed enough (ICS ends), incident retrospective, and follow-ups.
Incident Tooling: many tools but almost entirely focused on communications and coordination.
# Anatomy of Incidents
Incidents differ from planned work.
They are high stakes—the outcome matters.
They are also high cadence. The time to resolution matters. Trade-off decisions will be different under this time pressure.
Most importantly, almost all incidents of any note are a group activity. There are many people involved in responding.
Incident Commander regulates three flows: emotion, information, and analysis.
High stakes, high cadence, many people. Emotions are running high. Three classic responses to danger are fight, flight, or freeze. When you need your prefrontal cortex running the show you get flooded with stress hormones and end up running around frantically, or curled up in the fetal position in the corner.
# IC Behaviors
IC's first role is to get people out of the reactive mode, to de-escalate anyone getting combative or panicking, and to help get people moving if they get frozen or stuck and banging their head against the one idea.
People in reactive mode are terrible at solving complex engineering problems. Faster you can calm the room the faster you can get to a resolution.
Regulating the flow of information requires knowing who is engaged in the response and what they need to know.
Engineers know their systems and care about the details of what is happening.
Customer support don't know or maybe care as much about the technical details but care deeply about the current impact to customers and how to set expectations for time to resolution.
Executives, security, legal all need someone to speak their language. ICs must serve as a bit of a translator.
This flowing of information is about listening, filtering, and acting on what's meaningful. Noticing when some key expertise is missing and paging the right domain experts. Sometimes paying attention to need for water or food.
Regulating the analysis is not knowing the answers but asking good questions.
Challenge assumptions.
Ask for evidence to confirm or deny that hypothesis about what's happening.
Ask people to articulate their thought process—even if the IC doesn't understand the specific technical details they can keep a slightly wider view of the problem than the tech leads who are deep in the technical weeds.
Noticing when people are brute forcing a solution the IC might ask "can we bisect this?"
# IC Competencies
Reasonably fluent in your systems, both technical and human.
How to the technical components fit together? What areas are under the most strain these days?
How does the organization fit together? Who do we call for specific domain expertise?
Need to be familiar with the incident response process. Not memorized but some sense of muscle memory.
Do they know where the docs are?
Knows the company priorities so they can make informed sacrifice decisions when necessary.
# Training Great Incident Commanders
Basics of incident command system are taught to all engineers.
Three traits for the high-pressure incident commanders.
Strong technical vocabulary. They need to understand the conversations that's half in the room, and half virtual. They don't need to be expert in all the things, but need to understand the jargon.
They need to be well calibrated in their own expertise. They need to know what they don't know and who to ask at those boundaries of their own knowledge.
They need to have skills at self-regulation. They need to be able to project calm while still keeping a motivated sense of urgency in the room.
They need to be willing. They need to know this work is valuable. The skills they will learn and understanding of the platform and organization are valuable.
The organization must recognize engineers for this excellence. Think career ladders, bonuses, promotions.
# Blameless Culture
Incident commanders will be forced in some situations to make sacrifice decisions under extreme pressure without knowing what the outcome will be.
We mentioned before the importance of understanding company priorities.
It's vital that the ICs know the organization has their back and will support them regardless of the outcome of difficult, time-pressured decisions.
Back to the Amazon S3 outage... remember the triggering event was an authorized engineer following an official playbook typed in the wrong hostname or something. Amazon didn't blame the engineer. They said you shouldn't be able to take down S3 with a one-line command. That is a system problem, not "human error."
A good organization, for its own sake and for the sake of its people understands these things and will support the tech leads doing the troubleshooting and the incident commanders making the tough decisions.
People make the best decisions they can at the time with the best of their knowledge.
# Practice
Game days
Fault injection
Adversarial game days
(Use ICS for routine maintenance)
Shadow experienced Incident Commanders and have the veterans shadow newer commanders through their first major incidents.
# Excellence
Here are habits of our top notch incident commanders.
One of our NERFs calls it being familiar with the explody bits of the system.
They are constantly tuned into what's happening in the incident space and aware of vulnerabilities of the system.
This helps incident commanders develop an intuition when something weird is happening. You know what is likely to explode and can often shortcut people to jump and look at likely culprits.
Knowing the boundaries of your own expertise and knowing the organization well to know who are on the other sides of your own boundaries.
Being willing to draw on other resources in the company.
Excellent ICs know that in these high-stake situations you have to make the whole greater than the sum of the parts. Instead of being the resource, good ICs know it scales much better to gather the right expertise.
They know that the best resolution does not come from understanding the one best answer, but rather to get multiple perspectives from multiple people who understand different aspects of the system and compare them for useful answers.
There is no right answer but there are more and less useful answers.