Skip to content

Commit b589333

Browse files
authored
Add Rituals Guide (#56)
This PR adds a new guide for common rituals, and kicks things off with guides for Retrospectives and Postmortems.
1 parent 6afaf92 commit b589333

File tree

3 files changed

+101
-0
lines changed

3 files changed

+101
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Conventions, processes and notes about how we do things.
1111
- **[Today I Learned](./til/)**
1212
- **[Setup npm Publish](./npm)**
1313
- **[Open Source Guide](./opensource)**
14+
- **[Rituals Guide](./rituals)**

rituals/README.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Rituals
2+
3+
This guide documents some common rituals we encourage in our work.
4+
5+
These are not ironclad processes we should always follow. More like a helpful jumping-off point if you're on a project where they would be useful.
6+
7+
## [Postmortems](./postmortems.md)
8+
9+
> “A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.” — Google's _Site Reliability Engineering_
10+
11+
This ritual is held when a serious incident happens, such as a site outage, and the team needs to understand what happened and how to avoid it in the future.

rituals/postmortems.md

+89
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Incident Postmortems
2+
3+
Postmortem documents are a ritual designed to examine serious incidents or outages. Google’s [book on Site Reliability Engineering](https://landing.google.com/sre/book.html) says:
4+
5+
> A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
6+
7+
## Purpose
8+
9+
We practice postmortems to ensure we understand and address the root cause of severe incidents such as outages, data loss, or serious production bugs.
10+
11+
> "Don't make the mistake of neglecting a post-mortem after an incident. Without a post-mortem you fail to recognize what you're doing right, where you could improve, and most importantly, how to avoid making the same exact mistakes next time around. A well-designed, blameless post-mortem allows teams to continuously learn, and serves as a way to iteratively improve your infrastructure and incident response process." — [PagerDuty](https://response.pagerduty.com/after/post_mortem_process/)
12+
13+
### What is a Postmortem?
14+
15+
A postmortem is a document that examines an incident in detail, including:
16+
17+
- An summary of what happened
18+
- The incident's impact
19+
- What caused the incident
20+
- How the incident was resolved
21+
- A detailed timeline
22+
- What could have prevented the incident
23+
24+
### Why Do We Do Postmortems?
25+
26+
The goal of the postmortem is to gain a detailed understanding of the root causes of the incident to avoid it happening again in the future. A secondary goal can be to reassure the client, since the actions taken during an incident response may not be visible to them.
27+
28+
For postmortems to be effective at reducing repeat incidents, the review process has to incentivize teams to honestly identify root causes and fix them. For this reason, we practice **blameless postmortems** (see below).
29+
30+
### When is a Postmortem Needed?
31+
32+
It depends on the client or project. For applications or sites with an service-level agreement, postmortems are commonly carried out for high-severity incidents that violate the SLA. For client applications or site, a postmortem may only be called for following a major outage or quality-control problem.
33+
34+
> "Incidents in your organization should have clear and measurable severity levels. These severity levels can be used to trigger the post-mortem process. For example, any incident Sev-1 or higher triggers the postmortem process, while the postmortem can be optional for less severe incidents." — [Atlassian](https://www.atlassian.com/blog/statuspage/incident-postmortem-writing-tips)
35+
36+
The postmortem document should be produced within 24-48 hours of the incident's resolution, while it's still fresh in everyone's memory.
37+
38+
> "Despite how painful an outage may have been, the worst thing you can do is to bury it and never properly close the incident in a clear and transparent way. Most humans come together in times of crisis and communication around outage post-mortems, in my experience, has always been met with positive energy, understanding comments, constructive suggestions and numerous offers to help." — [Daniel Doubrovkine says](https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/)
39+
40+
### Who Completes the Postmortem?
41+
42+
In a small company like ours, the most senior engineer with direct knowledge should be writing an outage postmortem. It's their job and responsibility to acknowledge, understand and explain what happened. For particularly sensitive topics, it may make sense to escalate this responsibility to an engineering manager or founder.
43+
44+
> "Focusing attention away from the individual contributors allows the team to learn from the mistakes and address the root causes in time without the unnecessary stress or pressure during a crisis." — [Daniel Doubrovkine](https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/)
45+
46+
### Who is the Postmortem For?
47+
48+
The postmortem is intended for public consumption, especially by clients. It's a visible way to document not just the problem that happened, but how you addressed it and are ensuring it won't happen again. A properly written postmortem should increase your customer's faith in you.
49+
50+
> "The postmortem audience includes customers, direct reports, peers, the company's executive team and often investors. The document may be published on your website, and otherwise goes to the entire team. It's critical to bcc everyone. This is the equivalent of a locked thread, avoiding washing the laundry in public: one of the worst possible things to see is when a senior manager replies back pointing an individual who made a mistake, definitely not an email you want accidentally sent to the entire company." — [Daniel Doubrovkine](https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/)
51+
52+
### Running a Postmortem Meeting
53+
54+
Some teams hold a meeting after the postmortem document is produced. These meetings are generally short, only 15-30 minutes, and are intended to be a wrap-up of the postmortem process. We discuss what happened, what could have gone better, and any followup actions we need to take. The point of the meeting is to ensure there's no disagreement on the analysis, and spread a wider awareness of problems the team is facing.
55+
56+
## Blameless Postmortems
57+
58+
Many teams have adopted “blameless” postmortems, which focus on systemic problems and root causes without naming individuals or casting blame onto people or teams. Here's John Allspaw, from [Blameless Postmortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
59+
60+
> Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:
61+
62+
> - what actions they took at what time,
63+
> - what effects they observed,
64+
> - expectations they had,
65+
> - assumptions they had made,
66+
> - and their understanding of timeline of events as they occurred.
67+
68+
> …and that they can give this detailed account without fear of punishment or retribution.
69+
70+
> Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.
71+
72+
For a good example of why blameless postmortems matter, I strongly encourage you to watch [Who Destroyed Three Mile Island?](https://www.youtube.com/watch?v=hMk6rF4Tzsg), a talk by Nickolas Means from Lead Dev London 2018.
73+
74+
## Tips
75+
76+
- Make sure the timeline is an accurate representation of events.
77+
- Use the [Five Whys](https://en.wikipedia.org/wiki/5_Whys) technique to traverse the causal chain until you find a good true root cause.
78+
- Don't change details or events to make things "look better". We need to be honest in our post-mortems, even to ourselves, otherwise they lose their effectiveness.
79+
- Don't name and shame someone. We keep our post-mortems blameless. If someone deployed a change that broke things, it's not their fault, it's our fault for having a system that allowed them to deploy a breaking change, etc.
80+
- Avoid the concept of "human error". This is related to the point above about "naming and shaming", but there's a subtle difference - very rarely is the mistake "rooted" in a human performing an action, there are often several contributing factors (the script the human ran didn't have rate limiting, the documentation was out of date, etc...) that can and should be addressed.
81+
82+
## Resources
83+
84+
* [Postmortem Template](https://docs.google.com/document/d/12Prd33SDG1U0yE_gwXUgwa85Vn6dS_Qo5RdEWwFzFEo) document on Google Drive
85+
* [Postmortem Handbook from Atlassian](https://www.atlassian.com/incident-management/handbook/postmortems)
86+
* [Postmortem Process from PagerDuty](https://response.pagerduty.com/after/post_mortem_process/)
87+
* [Effective Postmortem Tips from PagerDuty](https://response.pagerduty.com/after/effective_post_mortems/)
88+
* [Blameless Postmortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
89+
* [How to Write Great Outage Postmortems](https://artsy.github.io/blog/2014/11/19/how-to-write-great-outage-post-mortems/)

0 commit comments

Comments
 (0)