John Allspaw reviews current software industry practice for learning from incidents and suggests how we can improve to get more value out of our incidents. youtube ![]()
YOUTUBE M8mYPyRG1fQ John Allspaw, presenting at Spotify internal conference, offers specific improvements and rationale for incident analysis.
Interview people ahead of the retro meeting.
Create retros that reveal more data about the incident—contrast different points of view.
Change the focus of incident analysis to creation of a report that tells a compelling story of the messy details. • What was difficult to understand? • What was surprising? • How do people understand the origins of the incident? What mysteries still remain?
Experiments for your incident review process:
• Separate generation of follow-up items from the incident retro.
• Record in the report who responded to the incident and who attended the retro.
• Capture in the report things that were done after the incident but before the retro.
• Ask brand new engineers to review the report & record any and all questions they have. (e.g. generate links in the report to company-specific jargon—make things easy for your reader).
• Ask more people to draw diagrams in interviews and retros and include them in the report.
• Have someone not involved in the event lead the analysis.
• Analyze the incident for people who were not in the incident.
• Focus on increasing the number of people who want to read reports and attend retros. Resist focusing on reducing number of incidents. (Eric adds: trust that everyone in the room is already motivated to prevent future incidents.) What people will learn from the reports and retros will influence they do that goes into not having incidents in the first place.
Interesting documents get read. Compelling documents also get shared with others. Fascinating documents draw comments and questions, are referenced in code comments, pull requests, architecture diagrams, other incident reports, new-hire onboarding, ...