Whenever something breaks in production, an INCIDENT ticket gets created on Jira. An Engineer/ Developer will figure out what went wrong and attach a document called RCA document to the ticket. This article talks about the process of what happens after the problem is figured out and how you can encourage other developers to learn from mistakes made in the past.
At RazorpayX, we have a scheduled RCA meet every Thursday. We discuss issues that occurred in the past week and what we learned from them. The environment is of curiosity and the theme is to understand and learn.
RCA meetings are also called Postmortems at some organisations. They refer to the same thing
An RCA meet or Root Cause Analysis meet is a formal meeting where you discuss why and how a particular problem occurred. It is not specific to the tech industry and this practice originated in the medical industry to analyse the cause of patient deaths. However, it is also useful in debugging and educating developers about software related problems.
The goal of this exercise
The goal of this exercise is to understand all contributing factors and causes, document the incident for future reference and also create action items to reduce the likelihood of this problem occurring again. This exercise can also be performed async but from what I have seen, most folks prefer scheduling a meeting to discuss the document.
It could be a frontend bug, backend failure or even a DevOps issue. The goal is to inform and educate other stakeholders about what went wrong and analyse the cause of the problem. These meetings are not to assign blame towards an engineer or interrogate the team on why a particular system failed. If this is what is happening at your RCA meets, you need to restructure how you perform them.
An engineering leader organises and leads the meeting. The meeting should be conducted with a clear goal and theme in mind and the participants should be pre-informed about the topic. Usually, an RCA document is created to be shared across stakeholders to read up before joining the RCA meet. If the problem has affected parties outside your organisation, a separate external RCA may also be created with relevant information to be sent to the third party.
The RCA Document
The RCA document is a story where your bug is the antagonist. It explains their rise to power as well as their eventual downfall and the changes they brought about.
There are a few things that an RCA document must contain:
1. Problem Description
This is a no brainer. Every RCA document should succinctly describe the problem at the top of the document and give a brief overview
2. Timeline of events
It should contain all the details about when the issue was discovered, by whom and how was it reported. It should also contain the first response time to the issue as well as every escalation event. If the issue has been fixed, it should contain the timeline events about the fix as well.
3. The 5 WHYs
The 5 WHYs are a great way to explain the issue. It is an interrogative technique where each question forms the basis for the next one. A very simple example would be:
Why does the Earth have seasons?
Ans) because the Earth revolves around the Sun
Why does the Earth revolve around the Sun?
Ans) Because of the Sun's gravitational pull
This is a great tool because it helps you get closer to the root of the problem
4. Action Items
Hurray! you identified the problem... but now what? Every RCA document should contain action items about what is/was done to resolve the issue. These items help the broader audience understand how the problem was mitigated and might help them in the future.
An issue can impact one customer or thousands of customers. It is important to understand the scale at which the problem has occurred and the impact that it has created. This also helps estimate the severity of the problem
It is a good idea to limit a meeting by time. At RazorpayX, our RCA meets are time-limited to one hour. Sometimes we discuss more than one issue depending on how complex the problem is. This helps you keep discussions on point.
Ask the participants to pre-read the document before joining the meeting. This will help save time which would have instead gone into explaining the problem during the meeting.
A team member from the team performing the postmortem of the issue takes the participants through the RCA document. At the start of the meeting, describe the problem and clarify any doubts about the problem itself. After which, you get into the probable causes, discuss the timelines and speak about the action items. Here, participants may give suggestions about the action items and may ask for more information about the issue.
You should also leave time in the end for questions from the participants of the meeting. Focus on the pointers above in the meeting as you go through the document with the participants.
What you shouldn't do in the meeting
- Place blame upon someone: as I stated before, these meetings aren't for blaming other people. They are there to offer learnings and help stakeholders understand the problem
- Get sidetracked into side issues: focus on the issue at hand and don't rant about an irrelevant issue
- Become a victim of bias. In hindsight, everything is obvious
You will see that in these meetings that resolutions were "common sense" and the problems were all "obvious oversights". You will also encounter many flawed assumptions as well as learn about different systems and how they interact with each other. See these meetings as learning opportunities at your company rather than a way to shift the problem onto someone else. Engineers should not be penalised because, frankly, bugs are expected even with software having years of maturity
At the end of an RCA meet, you should have the following points clarified:
- How did the issue occur
- How was the issue resolved
- What was the turnaround time for resolution
- What were the process gaps that led to the issue
- Were there any architectural mistakes that led to it
- How can similar problems be mitigated in the future
- What are some suggested changes that could be made to reduce issues like these