A veteran reliability engineer suggested four articles to review when writing guidelines for on-call in the software business. We liked three of them.
Emel Gogrusoz, Director of Product Management for OpsGenie. Advice to Management Teams While Enrolling Changes to On-Call Systems. article
— transparency, training, maintenance of alerting and systems, review on-call reports
Cody Wilbourn. Sustainable On-Call. blog
— keep both frequency and volume of alerts in mind when grooming alerts.
Tammy Butow, Principal SRE, Gremlin. How to Establish a High Severity Incident Management Program blog
— this is a good-enough outline of ways to organize incident response teams. A decent place to start for establishing severity levels, roles of commander and tech lead, and similar.
.
Clear guidelines about other work when on-call. Don't try to code the same day you're on-call. Encourage rest for engineers following nighttime pages.
Develop a training plan for on-call responsibilities.
A healthy on-call system can only exist if improved constantly by fine-tuning processes and systems. Remove noisy alerts. Keep only alerts that are actionable by the person paged for the alert. Update runbooks as the systems change. Teams should monitor their on-call & alert reports to identify systems that need operational improvements.
Management guidance: review on-call reports to monitor for burnout; arrange means to hear feedback from on-call people and respond quickly to requested changes.
Frequency and Volume are key categories of on-call stressors. Reduce frequency by having a larger group of people in the on-call rotation and by ruthlessly grooming alerts to keep only actionable alerts. Alert volume requires attention to the human-computer coordination.
"The on-call culture has to understand that people are delivering the software, and people maintain the software. High quality software demands high quality operations."
"On-call really gets to customer needs and maintaining your software."
Managers should know when their team members were paged. Managers should go out of their way with gratitude for the impact on family and personal life.
.
Something is missing from some of these articles. We need to emphasize that all of these stories about on-call processes and structures are useful fictions.
Real incidents are messy and do not conform to one model of an incident lifecycle. But learning how to respond during incidents is greatly aided by pretending there is such a lifecycle.