Incident Command System

The Incident Command process focuses on clear communication, delegation, and trust between teams working in harmony. nrrd 911 ic me: The Incident Commander Role, Alice Goldfuss, SRECon 2016 usenix

HTML5 mp4 https://2459d6dc103cb5933875-c0245c5c937c5dedcca3f1764ecc9b2f.ssl.cf2.rackcdn.com/srecon16/goldfuss.mp4 nrrd 911 ic me: The Incident Commander Role, Alice Goldfuss, SRECon 2016

ICS born in Arizona in 1968 among firefighters.

Incident Commander orchestrates; does not fix

Technical Leads fix the problem and communicate updates to keep incident commander in the loop

Communications Lead handles the external communications. They act as bridge to customers, both inbound comms and outbound comms.

Severity Levels: 5—everything is okay...for now; 4—a thing is smoldering; 3—a part of a thing exploded; 2—one whole thing exploded; 1—everything exploded;

At Sev 1 you unlock bonus roles.

Emergency Commander—manages communications with all the rest of the business. Incident commander is focused on the technical fix. Emergency commander is focused on all the other stakeholders.

Logistics Lead—ordering pizza, managing shifts, sending people home (so they can take over the next shift if the incident goes into overtime).

Why adopt ICS?

Large enough distributed system that incidents have become a regular occurrance.

Absence of clear roles: confusion, misalocated resources.

Prevents panic, coordinates effort, maintains reliable lines of communication, allows best possible resolution.

"The middle of an incident is the worst possible time to figure out who's doing what when."

How to adopt?

Train Everyone.

Yes even the brand new people. Create a culture that trusts people to use their best judgement. They will have a foundation when witnessing others running incidents to deepen their own understanding.

Training plan.

Retrain everyone regularly.

Coordinate training of ICs, TLs, CLs.

Roleplay with hands-on activities.

They all need practice playing their roles under pressure—even if it's a roleplay of pressure.

Tooling.

Hubot—coffeescript chat bot framework.

nrrd 911 ic me # Alice claims commander role nrrd 911 set status Zombies are attacking the data center nrrd 911 cl me # Yoni claims communications lead nrrd 911 over # close the incident

Upboard — database collects start and end and status changes and severity changes.

Google docs

New Relic products

Blameless retros—people get along better and want to fix things if you aren't coming along and pointing fingers and blaming them. It's not the people who broke it's the systems.

Lessons learned from a few years of running ICS

Single source of truth. We started with training program in the hands of the training team and nrrdbot and upboard in engineering teams. Those sources diverged. Worked better to put those responsibilities together.

Tools break. Your chat system might not be as hip as you thought it was. Have a backup chat system.

Iterate. Training started at 90mins and refined to about 45 minutes. Blameless retros came late to New Relic's process.

.

By 2018, Alice was no longer at New Relic, but was a living legend among my coworkers. Two of the best SREs at the time described Alice as the best incident commander they'd ever seen.

The team I joined at New Relic was responsible for the tools, docs, processes, and training materials Alice described.

I described working on that team at a talk at DevOps Boulder in the Fall 2020.

bletchley punk @alicegoldfuss: ask the Doc about a cloud backdoor he found during a datacenter outage. one of my fave incidents. tweet