Step by Step

Step by Step

Below are the steps involved in performing a postmortem at a high level. Below are the details of how to perform each step.

  1. Create a new postmortem for the incident.
  2. Populate the incident timeline with important changes in status/impact and key actions taken by responders.
  3. Schedule a postmortem meeting within the required timeframe for all required and optional attendees.
    • For each item in the timeline, include a metric or some third-party page where the data came from.
  4. Analyze the incident.
    • Identify superficial and root causes.
    • Consider technology and process.
  5. Open any immediate follow-up action tickets.
    • These, as well as additional ones, will be discussed in the postmortem meeting.
  6. Ask for review.
  7. Attend the postmortem meeting.
  8. Share the postmortem.

Owner Responsibilities#

At the end of a major incident call, the incident response team decides on one responder to own the postmortem. Writing the postmortem will ultimately be a collaborative effort, but selecting a single owner will help ensure it gets done.

The owner of a postmortem is responsible for the following:

The owner of a postmortem creates the postmortem document and updates it with all relevant information.

Administration#

Administration

  1. Create the document.
  2. Add all responders to it.
  3. Schedule the meeting.

The postmortem owner's first step is to create a new, empty postmortem for the Incident. Go through the history in Slack to identify the responders and add them to the page so they can help populate the postmortem.

Next, schedule the postmortem meeting for 30 minutes to an hour, depending on complexity of the incident. Scheduling the meeting at the beginning of the process helps ensure the postmortem is completed within the SLA. The meeting should be scheduled within a week from the incident. Don't worry about finding the best time for all attendees. The priority is to schedule within this timeframe and attendees should adjust their schedules accordingly.

Invite the following people to the postmortem meeting:

Create a Timeline#

Timeline Begin by focusing on the timeline. Document the facts of what happened during the incident. Avoid evaluating what should or should not have been done and coming to conclusions about what caused the incident. Presenting only the facts here will help avoid blame and supports a deeper analysis. Note the incident may have started before responders became aware of it and began the response effort. The timeline includes important changes in status/impact and key actions taken by responders. To avoid hindsight bias, start your timeline at a point before the incident and work your way forward instead of backwards from resolution.

Review the incident log in Slack to find key decisions made and actions taken during the response effort. Also include information the team didn't know during the incident that, in hindsight, you wish you would have. Find this additional information by looking at monitoring, logs, and deployments related to the affected services. You'll take a deeper look at monitoring during the analysis step, but start here by adding key events related to the incident, and include changes to incident status and the impact to the timeline.

For each item in the timeline, identify a metric or some third-party page where the data came from. This helps illustrate each point clearly and ensures you remain rooted in fact rather than opinions. This could be a link to a monitoring graph, a log search, a tweet, etc.—anything that shows the data point you're trying to illustrate in the timeline.

Key Takeaways

  • Stick to the facts.
  • Include changes to incident status and impact.
  • Include key decisions and actions taken by responders.
  • Illustrate each point with a metric.

Document Impact#

Impact Impact should be described from a few perspectives:

Analyze the Incident#

Analyze Now that you have an understanding of what happened during the incident, look further back in time to find the contributing factors that led to the incident. Technology is a complex system with a network of relationships (organizational, human, technical) that is continuously changing.

In his paper, "How Complex Systems Fail," Dr. Richard Cook says that because complex systems are heavily defended against failure, it is a unique combination of apparently innocuous failures that join to create catastrophic failure. Furthermore, because overt failure requires multiple faults, attributing a "root cause" is fundamentally wrong. There is no single root cause of major failure in complex systems, but a combination of contributing factors that together lead to failure. The postmortem owner's goal in analyzing the incident is not to identify the root cause, but to understand the multiple factors that created an environment where this failure became possible.

Cook also says the effort to find the "root cause" does not reflect an understanding of the system, but rather the cultural need to blame specific, localized forces for events. Blamelessness is essential for an effective postmortem. An individual's action should never be considered a root cause. Effective analysis goes deeper than human action. In the cases where someone's mistake did contribute to a failure, it is worth anonymizing this in your analysis to avoid attaching blame to any individual. Assume any team member could have made the same mistake. According to Cook, "all practitioner actions are actually gambles, that is, acts that take place in the face of uncertain outcomes."

The postmortem owner should start their analysis by looking at the monitoring for the affected services. Search for irregularities like sudden spikes or flatlining when the incident began and leading up to the incident. Include any commands or queries used to look up data, graph images, or links from monitoring tooling alongside this analysis so others can see how the data was gathered. If there is not monitoring for this service or behavior, make building monitoring an action item for this postmortem. More on writing action items below.

Importance of Monitoring

Puppet's 2018 State of DevOps Report highlights making monitoring configurable by the team operating the service as a foundational practice for successful DevOps. Empowering teams to define, manage, and share their own measurement of performance contributes to a culture of continuous improvement.

Another helpful strategy for targeting what caused an incident is reproducing it in a non-production environment. Experiment by modifying variables to isolate the phenomenon. If you modify or remove some input does the incident still occur?

This level of analysis will uncover the superficial causes of the incident. Next, ask why the system was designed in a way to make this possible. Why did those design decisions seem to be the best decisions at the time? Answering these questions will help you uncover root causes.

Here are some questions to help the postmortem owner identify the class of a particular problem:

Though it may not be a root cause, consider the process in your analysis. Did the way that people collaborate, communicate, and/or review work contribute to the incident? This is also an opportunity to evaluate and improve the incident response process. Consider what worked well and didn't work well within the incident response process during the incident.

Write a summary of the findings in the postmortem. The team may find further learnings and identify additional causes through discussion in the meeting, but the owner should do as much pre-work and documentation as possible to ensure a productive discussion.

Questions to Ask#

Below is a non-exhaustive list to help stimulate deep analysis. Ask "how" and "what" questions rather than "who" or "why" to discourage blame and encourage learning.

Cues
  • What were you focusing on?
  • What was not noticed?
  • What differed from what was expected?
Previous Knowledge/Experience
  • Was this an anticipated class of problem or did it uncover a class of issue that was not architecturally anticipated?
  • What expectations did participants have about how things were going to develop?
  • Were there similar incidents in the past?
Goals
  • What goals governed your actions at the time?
  • How did time pressure or other limitations influence choices?
  • Was there work the team chose not to do in the past that could have prevented or mitigated this incident?
Assessment
  • What mistakes (for example, in interpretation) were likely?
  • How did you view the health of the services involved prior to the incident?
  • Did this incident teach you something that should change views about this service's health?
Taking Action
  • How did you judge you could influence the course of events?
  • What options were taken to influence the course of events? How did you determine that these were the best options at the time?
  • How did other influences (operational or organizational) help determine how you interpreted the situation and how you acted?
Help
  • Did you ask anyone for help?
  • What signal brought you to ask for support?
  • Were you able to contact the people you needed to contact?
Process
  • Did the way that people collaborate, communicate, and/or review work contribute to the incident?
  • What worked well in your incident response process and what did not work well?

Key Takeaways

  • Find contributing factors, not the root cause.
  • Focus on the system, not the humans.
  • Look for anomalies in monitoring.
  • Reproduce and experiment in a non-production environment.
  • Don't forget to review your processes.

Follow-Up Actions#

Followup After identifying what caused the incident, ask what needs to be done to prevent this from happening again. Based on your analysis, you may also have proposals to reduce the occurrence of this class of problem, rather than this specific incident from recurring.

It may not be possible (or worth the effort) to completely eliminate the possibility of this same incident or a similar incident from happening again, so also consider how you can improve detection and mitigation of future incidents. Does the team need better monitoring and alerting around this class of problem so they can respond faster in the future? If this class of incident does happen again, how can the team decrease the severity or duration? Remember to identify any actions that can make the incident response process better, too. Go through the incident history in Slack to find any to-do items raised during the incident and make sure these are documented as tickets as well. (At this phase, you are only opening tickets. There is no expectation that tasks will be completed before the postmortem meeting.)

Create tickets for all proposed follow-up actions in your task management tool. Label all tickets with the postmortem-action-item label. Provide as much context and proposed direction on the tickets as you can so the team's product owner will have enough information to prioritize the task against other work and the eventual assignee will have enough information to complete the task.

In the ;login: magazine article, "Postmortem Action Items: Plan the Work and Work the Plan," John Lunney, Sue Lueder, and Betsy Beyer write about how Google writes postmortem action items to ensure they are completed quickly and easily. They advise all action items to be written as actionable, specific, and bounded.

Poorly Worded Better
Investigate monitoring for this scenario. Actionable: Add alerting for all cases where this service returns >1% errors.
Fix the issue that caused the outage. Specific: Handle invalid postal code in user address form input safely.
Make sure engineer checks that database schema can be parsed before updating. Bounded: Add automated presubmit check for schema changes.

Source: ;login: Spring 2017 Vol. 42, No. 1.

If there are any proposed follow-up actions that need discussion before tickets can be created, make a note to add these items to the postmortem meeting agenda. These may be proposals that need team validation or clarification. Discussing these items in the meeting will help decide how best to proceed.

Be careful with creating too many tickets. Only create tickets that are P0/P1s; i.e., tasks that absolutely should be dealt with. There will be some trade-offs here, and that's fine. Sometimes the ROI isn't worth the effort that would go into performing an action that may reduce the recurrence of the incident. When that is the case, it is worth documenting that decision in the postmortem. Understanding why the team is choosing not to perform an action helps avoid learned helplessness.

Note the person who creates the ticket is not responsible for completing it. Tickets are opened under the projects for the teams that own the affected service. At least one representative for all teams that will be responsible for a follow-up action are invited to the postmortem meeting.

Key Takeaways

  • What needs to be done to reduce the likelihood of this, or a similar, incident from happening again?
  • How can you detect this type of incident sooner?
  • How can you decrease the severity or duration of this type of incident?
  • Write actionable, specific, and bounded tasks.