Incident YYYY-MM-DD — ¶
Clone this file to
<YYYY-MM-DD>-<slug>.mdand fill in every field. Delete this top-of-file note before merging. Slug = the failure mode (postgres-pool-exhaustion), not the symptom (site-was-slow). SeeREADME.mdfor conventions.
Header¶
| Field | Value |
|---|---|
| Severity | P0 / P1 / P2 |
| Started | YYYY-MM-DDTHH:MM:SSZ (first symptom observed) |
| Resolved | YYYY-MM-DDTHH:MM:SSZ (full recovery confirmed) |
| Duration | |
| IC | |
| Detected by | alert / customer report / engineer (which?) |
| Affected services | ebit-api / ebit-rt / ebit-fe / ebit-admin-fe / ebit-bj / ebit-bo / ebit-speed-roulette |
| User impact | one-line player-visible summary |
| Related | links to prior RCAs with the same root cause, if any |
Summary¶
Two or three sentences. What broke, what users saw, how it was fixed. Plain language; the rest of the document provides the detail.
Timeline¶
All timestamps in UTC. Keep entries fact-only — no judgment, no people-naming. One line per material event.
| Time (UTC) | Event |
|---|---|
| HH:MM | Symptom first observed: <…> |
| HH:MM | Pager fired |
| HH:MM | IC ack; opened ebit-perf-test and perf-system Grafana dashboards |
| HH:MM | Hypothesis: <…> based on |
| HH:MM | Hypothesis confirmed via <…> |
| HH:MM | Mitigation: <…> applied |
| HH:MM | Full recovery confirmed: |
| HH:MM | Status page → resolved |
Root cause¶
The actual technical cause. Be specific. Cite file:line from the source tree. If the cause is a known finding, reference its SF-### from ../security-register.md. If it's a known weakness from the architecture doc, reference its AF-#.
Example shape:
The bet-settled queue worker deadlocked on a Prisma transaction that wrapped a remote RPC call (
apps/api/src/bet/queue/bet-settled.processor.ts:42-78). The RPC stalled when the speed-roulette container restarted; the transaction held its DB connection until the Prisma timeout (60s). With concurrency=8 workers, the pool was exhausted within 480 seconds.
Contributing factors¶
Bulleted. Anything that made the incident worse or longer than it had to be — bad alerting, missing runbook, ambiguous signals, recent change that increased blast radius. Not synonyms for the root cause.
- …
- …
What went well¶
Bulleted. Blameless positives. Aim for 2–4 entries; not every incident will have many. Use this section to recognize good practice (fast detection, clean comms, useful runbook).
- …
- …
What didn't go well¶
Bulleted. Blameless negatives. Each entry must be a system / process observation, not a person observation.
- …
- …
Action items¶
Numbered. Each entry needs an owner and a date. Mirror these into the team tracker; cite the tracker ID here once it exists.
| # | Action | Owner | Due | Tracker |
|---|---|---|---|---|
| 1 | Add an alert on pg_stat_activity.state = 'idle in transaction' count > 5 |
YYYY-MM-DD | {{TBD: tracker link}} | |
| 2 | Lift the speed-roulette restart sequence into ../runbooks/db-down.md §A |
YYYY-MM-DD | {{TBD}} | |
| 3 | Audit Prisma transactions for embedded RPC calls; file follow-up tickets per offender | YYYY-MM-DD | {{TBD}} |
Detection signals captured¶
Useful for future runbook authoring — the actual signals that would have detected this earlier, not just the ones that did.
- Grafana panel that first showed the deviation:
<dashboard> > <panel> - Loki query that first surfaced the error:
{service_name="ebit-api"} |= "<keyword>" - Jaeger search that found the failing trace: service=
ebit-api, operation=<…>, tags=error=true - Trace ID(s) of representative failing requests:
<traceID>
References¶
../handover/oncall-runbook.md— first-response procedure used during the incident../runbooks/<runbook-used-during-fix>.md— runbook(s) consulted during recovery../flows/<relevant-flow>.md— flow doc(s) that explain the affected subsystem../security-register.md— if a known finding contributed- PR(s) shipped as mitigation:
- Alert / page that fired (or should have):