Skip to content

Post-Incident Reviews (RCAs)

This directory holds the post-incident review (RCA) for every P0 / P1 / P2 incident that gets driven through the ../handover/oncall-runbook.md procedure. P3s are tracked in the backlog only — no RCA required.

Audience: anyone — engineering, customer team, leadership. RCAs are customer-safe by default. Anything that names a vendor under contract, a real customer, or contains an unredacted security finding goes under internal/ (see ../STYLE.md §8) — for the post-handover customer team, that subfolder is {{TBD: not yet created}}.


Conventions

File naming

<YYYY-MM-DD>-<slug>.md

Examples:

  • 2026-04-25-bullmq-worker-deadlock.md
  • 2026-05-12-postgres-pool-exhaustion.md

The slug is lowercase, hyphenated, and describes the failure mode, not the symptom. ("postgres-pool-exhaustion" not "site-was-slow"; "bullmq-worker-deadlock" not "bets-broken".) Same slug discipline as the runbooks directory.

Numbering

The template is 0000-template.md. Real incidents are dated, not numbered — sorting by filename gives chronological order automatically. Don't try to keep a rolling counter; date prefixes are good enough.

Blameless wording

Every RCA is blameless. Name systems and decisions; never name people in the failure narrative. People show up only in the IC field, the action-item owner field, and the "what went well" section (positive recognition is fine — negative attribution is not). Anonymize aggressively if the timeline would otherwise identify someone.


Authoring workflow

  1. Within 24 hours of resolution: IC clones 0000-template.md to <date>-<slug>.md. Fills in the entire template — no {{TBD}} markers in the merged file.
  2. Within 5 business days: blameless retro is held with the format from the ../handover/oncall-runbook.md §5. Action items captured in the RCA's "Action items" section.
  3. PR review: a senior engineer outside the incident participants reviews the RCA before merge. They check: every action item has an owner + date; the timeline is fact-only; no individual is named in the failure narrative.
  4. Action item tracking: every action item is mirrored into the team's tracker ({{TBD: customer-team to specify Jira / Linear / GitHub Issues link}}) with the RCA filename as a back-reference.
  5. Closure review: at the next retro after the original, action-item progress is reviewed. Any item past its due date without progress is escalated.

Index

Add new RCAs to the top of this list. Keep one row per incident; cluster related incidents only via the "Related" column.

Date Title Severity IC Status
(no incidents recorded yet)

See also