Incident YYYY-MM-DD — ¶

Clone this file to <YYYY-MM-DD>-<slug>.md and fill in every field. Delete this top-of-file note before merging. Slug = the failure mode (postgres-pool-exhaustion), not the symptom (site-was-slow). See README.md for conventions.

Field	Value
Severity	P0 / P1 / P2
Started	YYYY-MM-DDTHH:MM:SSZ (first symptom observed)
Resolved	YYYY-MM-DDTHH:MM:SSZ (full recovery confirmed)
Duration
IC
Detected by	alert / customer report / engineer (which?)
Affected services	ebit-api / ebit-rt / ebit-fe / ebit-admin-fe / ebit-bj / ebit-bo / ebit-speed-roulette
User impact	one-line player-visible summary
Related	links to prior RCAs with the same root cause, if any

Summary¶

Two or three sentences. What broke, what users saw, how it was fixed. Plain language; the rest of the document provides the detail.

Timeline¶

All timestamps in UTC. Keep entries fact-only — no judgment, no people-naming. One line per material event.

Time (UTC)	Event
HH:MM	Symptom first observed: <…>
HH:MM	Pager fired
HH:MM	IC ack; opened `ebit-perf-test` and `perf-system` Grafana dashboards
HH:MM	Hypothesis: <…> based on
HH:MM	Hypothesis confirmed via <…>
HH:MM	Mitigation: <…> applied
HH:MM	Full recovery confirmed: passed
HH:MM	Status page → resolved

Root cause¶

The actual technical cause. Be specific. Cite file:line from the source tree. If the cause is a known finding, reference its SF-### from ../security-register.md. If it's a known weakness from the architecture doc, reference its AF-#.

Example shape:

The bet-settled queue worker deadlocked on a Prisma transaction that wrapped a remote RPC call (apps/api/src/bet/queue/bet-settled.processor.ts:42-78). The RPC stalled when the speed-roulette container restarted; the transaction held its DB connection until the Prisma timeout (60s). With concurrency=8 workers, the pool was exhausted within 480 seconds.

Contributing factors¶

Bulleted. Anything that made the incident worse or longer than it had to be — bad alerting, missing runbook, ambiguous signals, recent change that increased blast radius. Not synonyms for the root cause.

…
…

What went well¶

Bulleted. Blameless positives. Aim for 2–4 entries; not every incident will have many. Use this section to recognize good practice (fast detection, clean comms, useful runbook).

…
…

What didn't go well¶

Bulleted. Blameless negatives. Each entry must be a system / process observation, not a person observation.

…
…

Action items¶

Numbered. Each entry needs an owner and a date. Mirror these into the team tracker; cite the tracker ID here once it exists.

#	Action	Due	Tracker
1	Add an alert on `pg_stat_activity.state = 'idle in transaction'` count > 5	YYYY-MM-DD	{{TBD: tracker link}}
2	Lift the speed-roulette restart sequence into `../runbooks/db-down.md` §A	YYYY-MM-DD	{{TBD}}
3	Audit Prisma transactions for embedded RPC calls; file follow-up tickets per offender	YYYY-MM-DD	{{TBD}}

Detection signals captured¶

Useful for future runbook authoring — the actual signals that would have detected this earlier, not just the ones that did.

Grafana panel that first showed the deviation: <dashboard> > <panel>
Loki query that first surfaced the error: {service_name="ebit-api"} |= "<keyword>"
Jaeger search that found the failing trace: service=ebit-api, operation=<…>, tags=error=true
Trace ID(s) of representative failing requests: <traceID>

References¶

../handover/oncall-runbook.md — first-response procedure used during the incident
../runbooks/<runbook-used-during-fix>.md — runbook(s) consulted during recovery
../flows/<relevant-flow>.md — flow doc(s) that explain the affected subsystem
../security-register.md — if a known finding contributed
PR(s) shipped as mitigation:
Alert / page that fired (or should have):