Skip to content

Incident YYYY-MM-DD —

Clone this file to <YYYY-MM-DD>-<slug>.md and fill in every field. Delete this top-of-file note before merging. Slug = the failure mode (postgres-pool-exhaustion), not the symptom (site-was-slow). See README.md for conventions.


Field Value
Severity P0 / P1 / P2
Started YYYY-MM-DDTHH:MM:SSZ (first symptom observed)
Resolved YYYY-MM-DDTHH:MM:SSZ (full recovery confirmed)
Duration
IC
Detected by alert / customer report / engineer (which?)
Affected services ebit-api / ebit-rt / ebit-fe / ebit-admin-fe / ebit-bj / ebit-bo / ebit-speed-roulette
User impact one-line player-visible summary
Related links to prior RCAs with the same root cause, if any

Summary

Two or three sentences. What broke, what users saw, how it was fixed. Plain language; the rest of the document provides the detail.


Timeline

All timestamps in UTC. Keep entries fact-only — no judgment, no people-naming. One line per material event.

Time (UTC) Event
HH:MM Symptom first observed: <…>
HH:MM Pager fired
HH:MM IC ack; opened ebit-perf-test and perf-system Grafana dashboards
HH:MM Hypothesis: <…> based on
HH:MM Hypothesis confirmed via <…>
HH:MM Mitigation: <…> applied
HH:MM Full recovery confirmed: passed
HH:MM Status page → resolved

Root cause

The actual technical cause. Be specific. Cite file:line from the source tree. If the cause is a known finding, reference its SF-### from ../security-register.md. If it's a known weakness from the architecture doc, reference its AF-#.

Example shape:

The bet-settled queue worker deadlocked on a Prisma transaction that wrapped a remote RPC call (apps/api/src/bet/queue/bet-settled.processor.ts:42-78). The RPC stalled when the speed-roulette container restarted; the transaction held its DB connection until the Prisma timeout (60s). With concurrency=8 workers, the pool was exhausted within 480 seconds.


Contributing factors

Bulleted. Anything that made the incident worse or longer than it had to be — bad alerting, missing runbook, ambiguous signals, recent change that increased blast radius. Not synonyms for the root cause.


What went well

Bulleted. Blameless positives. Aim for 2–4 entries; not every incident will have many. Use this section to recognize good practice (fast detection, clean comms, useful runbook).


What didn't go well

Bulleted. Blameless negatives. Each entry must be a system / process observation, not a person observation.


Action items

Numbered. Each entry needs an owner and a date. Mirror these into the team tracker; cite the tracker ID here once it exists.

# Action Owner Due Tracker
1 Add an alert on pg_stat_activity.state = 'idle in transaction' count > 5 YYYY-MM-DD {{TBD: tracker link}}
2 Lift the speed-roulette restart sequence into ../runbooks/db-down.md §A YYYY-MM-DD {{TBD}}
3 Audit Prisma transactions for embedded RPC calls; file follow-up tickets per offender YYYY-MM-DD {{TBD}}

Detection signals captured

Useful for future runbook authoring — the actual signals that would have detected this earlier, not just the ones that did.

  • Grafana panel that first showed the deviation: <dashboard> > <panel>
  • Loki query that first surfaced the error: {service_name="ebit-api"} |= "<keyword>"
  • Jaeger search that found the failing trace: service=ebit-api, operation=<…>, tags=error=true
  • Trace ID(s) of representative failing requests: <traceID>

References