Product Management Post-Mortem: Best Practices for Business

We've all been there - launch day arrives, and despite the countless hours of planning, something goes wrong. Maybe it's a minor bug that slipped through QA, or perhaps it's a critical issue that affects the user experience. Either way, the team is left scratching their heads, wondering, "What went wrong?" But what if I told you that these moments, as uncomfortable as they are, hold the key to future success in post-mortem business practices? Welcome to the world of Post-Mortems! It’s a practice that can transform setbacks into stepping stones.

What is a Post-Mortem? (Post Mortem Definition)

A Post-Mortem is a structured process for analyzing the events and decisions that led to a particular outcome in a project or initiative. It's not just a meeting to point fingers or assign blame; it's a collaborative effort to understand the "why" behind the "what happened." In essence, it's a diagnostic tool for your project's health, akin to a medical post-mortem that seeks to understand the cause of death. Except, in this case, the patient can be revived and made stronger than before.

The Value of a Post-Mortem in Project Management

You might be wondering, "Why go through the trouble?" The answer lies in the invaluable insights that a well-executed Post-Mortem can provide. Here are some key benefits:

  • Continuous Improvement: Post-Mortems offer a structured way to evaluate performance, making it easier to implement improvements systematically. They serve as a feedback loop that can be tighter than any agile sprint cycle.
  • Team Alignment: The process ensures that everyone is on the same page about what happened and why, fostering a culture of transparency and shared responsibility.
  • Risk Mitigation: By understanding the root causes of issues, you can proactively address them, reducing the likelihood of recurrence in future projects.
  • Accountability and Learning: A Post-Mortem creates a culture where accountability is embraced, and learning from mistakes is prioritized.

What a Post Mortem Isn’t

  • A Blame Game: The objective is not to find a scapegoat but to understand the contributing factors to an outcome. It's a nuanced approach that looks at the system, not just individual actions.
  • A One-Time Event: Post-Mortems are most effective when they are part of an ongoing strategy for continuous improvement. They should be as regular as your team meetings or sprint reviews.
  • A Quick Fix: While Post-Mortems can provide actionable insights, they are not a substitute for thorough planning and execution. They are a tool in your toolkit but should not be the only one you rely on.

Why Do We Do a Post-Mortem?

The main purpose of a post-mortem is to accomplish two things. First, it aims to dig deep and identify the root causes that led to a specific issue or set of issues. Understanding the "why" behind the "what" is crucial for any team looking to improve. Second, the post-mortem serves as a forum for discussing how to prevent similar issues from happening again. It's worth noting that the goal isn't necessarily to come up with immediate solutions. Some problems require more in-depth analysis, and given the time constraints of a post-mortem meeting, the focus should be on generating actionable items and setting up follow-up plans.

Who Should Schedule a Post-Mortem Meeting?

In most cases, the person who worked directly on the issue - often referred to as the "agent" - should be the one to schedule the post-mortem. If you're in this role and unsure how to proceed, this article aims to answer most of your questions. One critical point to remember is that post-mortems shouldn't be confined to internal team discussions. They should be publicly announced to ensure transparency and collective learning. For example, making an announcement in a designated Slack channel like #on-duty can be an effective way to keep everyone in the loop.

What Incidents "Trigger" a Post-Mortem?

Determining what incidents should trigger a post-mortem is crucial for its effectiveness. Here are some guidelines:

  • Customer-Noticed Incidents: Any incident that a customer has noticed should automatically trigger a post-mortem. Customer experience is paramount, and any disruption to it warrants a thorough investigation.
  • Potential Customer-Noticed Incidents: Even if customers haven't directly reported an issue, but it's something they could have noticed, it's better to be safe and conduct a post-mortem.
  • Internal Issues: Sometimes, the issues may not be customer-facing but are still significant enough to warrant investigation. These could be issues that consume a lot of time, recur frequently, or are particularly challenging to address.
  • Logged Incidents: If an issue has been formally logged in an incident log, it's a clear indicator that a post-mortem is necessary.
  • When in Doubt: If you're uncertain, it's always better to consult with your team or ask in a designated channel like #on-duty.

When to Schedule a Post-Mortem?

Timing is of the essence when it comes to post-mortems. The general rule of thumb is to schedule them as soon as possible after resolving the incident. Here's a quick guide:

  • Morning Incidents: If something goes wrong in the morning, aim to have the post-mortem later the same day.
  • Afternoon or Evening Incidents: If the issue arises later in the day, schedule the post-mortem for the following day.
  • Weekend or Holiday Incidents: In these cases, the post-mortem should be scheduled for the next working day.

The reason for this urgency is simple: details fade from memory quickly. The longer you wait, the more likely you are to forget key facts or start rationalizing the issue. That's why it's highly recommended for those involved in the incident to document facts as soon as they can.

Who Should Be Part of a Post-Mortem?

Determining who should attend a post-mortem is crucial for its success. Generally, the people involved in the incident, whether engineers or not, should be present. The agent on duty at the time of the incident and other engineers involved in resolving the issue must attend. Additionally, consider involving a representative from the customer success department and either a solution architect or project manager for issues that have impacted or may impact customers. However, try to keep the number of attendees to a maximum of six people, excluding the moderator. Anyone else interested can read the post-mortem meeting notes afterward.

The Role of the Moderator

The moderator's job is to ensure the post-mortem runs smoothly and adheres to a set process. They are responsible for taking notes and publishing them on the relevant Confluence page. The moderator should enforce a code of conduct that includes no blaming, no finger-pointing, and no guessing if facts can be easily collected. They have the authority to remove attendees or postpone the meeting under specific conditions. When choosing a moderator, opt for someone not involved in the incident or anyone familiar with the process.

How to Conduct a Post-Mortem

  • Introduction: The moderator starts by explaining why the meeting has been called and what the issue was.
  • Timeline: An agent provides a rundown of events leading to the incident, focusing solely on facts. Use tools like Slack History to collect this information.
  • Discussion: With the timeline as a base, the team discusses identifying the root causes of the incident.
  • Solutions: Identify both short-term and long-term solutions to prevent the issue from recurring.

Short-Term and Long-Term Solutions

  • Short-Term: Discuss immediate actions taken to resolve the issue and prevent its recurrence. Examples include deploying in pairs or adding additional alarms.
  • Long-Term: Discuss broader changes needed to prevent the issue from happening again, such as adding validation steps.

What to Focus On During the Discussion

  • How was the issue discovered?
  • How did it escape testing?
  • Has this incident occurred before?

Tracking Action Items

  • Short-Term: If it's code-related, create and assign a ticket immediately after the post-mortem, labeled with a post-mortem tag.
  • Long-Term: If it involves discussion with the product team, assign someone to take ownership of the solution.

Bi-Weekly Follow-Up Meetings

To ensure that action items from the post-mortem are being addressed, schedule bi-weekly open meetings that everyone interested can attend, and those assigned to post-mortem tickets should update on the status.

How to Write a Post-Mortem

Writing a post-mortem is an art that combines analytical thinking with effective communication. The goal is to create a document that not only serves as a record of events but also as a learning tool that can help prevent future incidents. Here's a step-by-step guide on how to write a compelling post-mortem:

Step 1: Introduction

Start by providing a brief overview of the incident. Include the date, time, and a high-level description of what happened. This sets the stage and gives readers context for what they're about to delve into.

Step 2: Incident Summary

In this section, outline the key details of the incident. This should include:

  • Impact: Describe how the incident affected customers, business operations, or any other relevant areas.
  • Duration: Specify how long the incident lasted.
  • Trigger: Explain what caused the incident in the first place.

Step 3: Timeline of Events

Create a detailed timeline that chronicles the incident from start to finish. Use precise timestamps and include all significant events, such as when the incident was detected when the team was alerted, and what steps were taken to resolve it.

Step 4: Root Cause Analysis

This is the core of the post-mortem. Use the 5 Whys technique or a similar method to drill down into the root cause of the incident. Be thorough but avoid blaming individuals; focus on processes and systems.

Step 5: Lessons Learned and Action Items

List the key takeaways from the incident and the post-mortem discussion. Then, outline the action items, specifying who is responsible for each and setting deadlines.

Step 6: Conclusion

Wrap up the post-mortem by summarizing the key points and reiterating the action items. This serves as a quick reference for anyone revisiting the document later.

Post-Mortem Example:

Here's a simplified example of a hypothetical incident where a software deployment caused a service outage:

  • Introduction:
    On January 15th, a software deployment led to a 30-minute service outage affecting approximately 500 users.
  • Incident Summary:some text
    • Impact: 500 users experienced service downtime
    • Duration: 30 minutes
    • Trigger: Software deployment to production
  • Timeline:some text
    • 2:00 PM: Deployment initiated
    • 2:05 PM: Outage detected
    • 2:10 PM: Rollback initiated
    • 2:30 PM: Service restored
  • Root Cause Analysis:
    The deployment script had a bug that wasn't caught during testing.
  • Lessons Learned and Action Items:some text
    • Improve testing procedures
    • Implement a rollback plan
  • Conclusion:
    The incident has led to a review of our deployment and testing procedures to prevent similar issues in the future.

Summary and Next Steps

In this comprehensive guide, we've delved into the intricacies of conducting and writing post-mortems in a business context. From understanding its value to knowing who should be involved and how to effectively document and learn from incidents, a well-executed post-mortem is an invaluable tool for continuous improvement. It's not just about identifying what went wrong; it's about creating a culture of accountability and learning that drives your team and your product forward.

If you found this article insightful and want to discuss post-mortems or any other aspect of product management further, I invite you to connect with me on LinkedIn.