Better Apology for Atlassian

On Tuesday, April 5th, 2022, starting at 7:38 UTC, 775 Atlassian customers lost access to their Atlassian products. The outage spanned up to 14 days for a subset of these customers. We acknowledge up front that this report tries to serve many audiences. The incident itself was painful. We feel for everyone at Atlassian. Nevertheless, we offer some suggestions to improve their report. report

The center of our objection is the executive summary. Many people will only read the executive summary.

The job of that executive summary is to reassure all the customers that they can continue to trust Atlassian at the center of business processes automation. For us, their executive summary undermines our sense of trust.

We needed to read a lot more “we know what a big incident this was. We know we have a lot of work ahead of us to rebuilt your trust.” We especially needed some signal that they understand the need to rebuild that trust with all their customers, not just the ones directly impacted.

They brag about the “open company, no bullshit” and almost immediately offer this:

> Although this was a major incident, no customer lost more than five minutes of data. In addition, over 99.6% of our customers and users continued to use our cloud products without any disruption during the restoration activities.

That is factually true. But it ignores the elephant in the room: two weeks outage. All 200,000+ of us customers saw it and are now faced with contingency plans—what if the worst case happened to us? Two weeks without the runbooks in our wiki, without the acceptance criteria in all our software tickets, without the hooks into our own incident management process and followups, without the hiring process documentation.

Here's what I wish they said:

> This was a major incident. Any number of customers being unable to use our products for as long as two weeks is embarrassing. We are deeply sorry. We know we have work to repair your trust in us—not just those directly impacted, but all Atlassian customers. > > Even amid this failure, there are some things our team accomplished that make us proud. No customer lost more than five minutes of data. In addition, over 99.6% of our customers and users continued to use our cloud products without any disruption even as we were so focused on the restoration activities.

This rephrasing does two things. It admits to the harm. And it turns the successful part into applause for heroic efforts of the teams involved. It lets leadership publicly applaud the recovery. I think that’s important for internal morale, and an important signal for customers that the people at the sharp end are respected.

One last thing… the measures they’re taking include improving incident response. I think what they also need is to build a learning from incidents team too. And to proclaim that they’re going to invest in deeper learning from their smaller incidents to be able to adapt earlier in ways that prevent larger incidents. (Tho’ I’ll admit, many in the business don’t yet know why that is a better signal to customers).