SNAFUcatchers and the resilience engineering community tell us seemingly contradictory things. Five Whys will mislead you. There is no Root Cause. MTTR is a meaningless metric. Go learn from your Incidents. If we can't ask five whys or look for root cause and can't trust our metrics, how do we learn from Incidents?
Incidents surprise us. We didn't know those two or three or four things could effect each other that way. Or maybe we did know they were related but didn't know it could get this bad. In some incidents, we designed and built the system to keep those things isolated from each other and none of us noticed when a recent change connected them.
When you think you've found the root cause, or the key decision that you wish had gone differently...
1. ...ask questions of the participants until you understand why that choice at that time and under those stressors looked like the best option. You will learn more about the current state of the system by trusting that everyone made the best choices they could at the time. Our instinctual or habitual response is to look for a bad actor or to look for some example of stupidity or carelessness. When we follow where those antagonistic questions lead, we cut short our chance to learn. We will find "solutions" that actually undermine resilience.
2. ...ask questions until you understand what else could have gone wrong that were prevented by these actions. Expand your investigation to clarify what other obstacles were capturing the attention of everyone involved at the time. When complex systems are actively failing, the operators will feel like they're navigating a minefield. Now that things have calmed down a bit, see if the gift of hindsight can help you triangulate some of mines in the field that you all managed to avoid.
3. ...try to get different teams into the same room to compare their mental models and their alerting and dashboards and runbooks around the sub-components involved in the incident. Build bridges or signaling systems with the other teams to improve coordination.