Post-Incident Reviews (RCAs)¶
This directory holds the post-incident review (RCA) for every P0 / P1 / P2 incident that gets driven through the ../handover/oncall-runbook.md procedure. P3s are tracked in the backlog only — no RCA required.
Audience: anyone — engineering, customer team, leadership. RCAs are customer-safe by default. Anything that names a vendor under contract, a real customer, or contains an unredacted security finding goes under
internal/(see../STYLE.md§8) — for the post-handover customer team, that subfolder is{{TBD: not yet created}}.
Conventions¶
File naming¶
Examples:
2026-04-25-bullmq-worker-deadlock.md2026-05-12-postgres-pool-exhaustion.md
The slug is lowercase, hyphenated, and describes the failure mode, not the symptom. ("postgres-pool-exhaustion" not "site-was-slow"; "bullmq-worker-deadlock" not "bets-broken".) Same slug discipline as the runbooks directory.
Numbering¶
The template is 0000-template.md. Real incidents are dated, not numbered — sorting by filename gives chronological order automatically. Don't try to keep a rolling counter; date prefixes are good enough.
Blameless wording¶
Every RCA is blameless. Name systems and decisions; never name people in the failure narrative. People show up only in the IC field, the action-item owner field, and the "what went well" section (positive recognition is fine — negative attribution is not). Anonymize aggressively if the timeline would otherwise identify someone.
Authoring workflow¶
- Within 24 hours of resolution: IC clones
0000-template.mdto<date>-<slug>.md. Fills in the entire template — no{{TBD}}markers in the merged file. - Within 5 business days: blameless retro is held with the format from the
../handover/oncall-runbook.md§5. Action items captured in the RCA's "Action items" section. - PR review: a senior engineer outside the incident participants reviews the RCA before merge. They check: every action item has an owner + date; the timeline is fact-only; no individual is named in the failure narrative.
- Action item tracking: every action item is mirrored into the team's tracker (
{{TBD: customer-team to specify Jira / Linear / GitHub Issues link}}) with the RCA filename as a back-reference. - Closure review: at the next retro after the original, action-item progress is reviewed. Any item past its due date without progress is escalated.
Index¶
Add new RCAs to the top of this list. Keep one row per incident; cluster related incidents only via the "Related" column.
| Date | Title | Severity | IC | Status |
|---|---|---|---|---|
| (no incidents recorded yet) |
See also¶
../handover/oncall-runbook.md— the live procedure that produces these documents0000-template.md— clone this for every new RCA../runbooks/— symptom-keyed cheat sheets the IC consults during the incident../security-register.md— known security findings; if a finding becomes an incident cause, link the RCA back to its SF-### number