Runbook: Postgres under high load¶
Symptom¶
ebit-api p95 latency climbs across the board; one or more of:
prisma:client:operationspans dominate the trace waterfall — most of the request budget is in DB calls.- Postgres container CPU > 80% in the
perf-systemGrafana dashboard (Container CPU panel). - Container memory RSS climbing past
shared_buffersheadroom (Container Memory RSS panel). - New requests pile up in the Nest event loop while existing ones wait for a DB connection.
- BullMQ queues lag because their workers can't open a Prisma connection inside
removeOnComplete/removeOnFailretention.
This runbook covers the four shapes of "DB is hot": connection-pool exhaustion, slow query, lock contention, autovacuum stalled.
Likely causes¶
- Missing index on a hot column — a recently shipped feature filters or joins on an unindexed column; the query plan turns into a sequential scan.
- Unbounded
ORDER BY— admin or bet-history endpoint sorts a large table without a matching index, falling back to in-memory sort. - Lock contention — two transactions hold incompatible locks; one blocks the chain.
- Autovacuum stalled — dead tuples accumulate; the planner picks bad plans; table/index bloat grows.
- Connection-pool exhaustion — the Prisma client default pool is small; long-running transactions or stuck queries hold connections, new requests wait.
Detection¶
- Grafana —
perf-system: Container CPU panel forebit-db; Container Memory RSS panel; both should stay < 70% under steady-state. - Grafana —
prisma-postgresdashboard: top-N slowest queries; Postgres lock-wait counter; rate of long transactions. - Loki:
{service_name="ebit-api"} |= "PrismaClientKnownRequestError" or "Timed out fetching a new connection". - Jaeger: filter
ebit-apitraces byerror=trueor sort by latency descending; the failing span will beprisma:client:operationwith the model + method as attributes.
Triage¶
1. Snapshot live state with the introspection script¶
This captures the state of pg_stat_activity (state counts, wait events), non-granted locks, pg_stat_database deltas (tup_returned/sec, tup_fetched/sec, blks_hit/sec, blks_read/sec, cache hit ratio), and autovacuum activity into /tmp/pg-introspect-<ts>/. Skim summary.txt first — falling cache hit ratio + rising blks_read/sec tells you which side of "hot" you're on (CPU vs IO).
2. Connection count by state¶
docker exec ebit-db psql -U ebit -d ebit -c "
SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = 'ebit'
GROUP BY state ORDER BY 2 DESC;"
Watch for:
- Many
idle in transactionrows — application-side bug, a transaction was opened and never closed. This eats connections faster than anything else. activecount nearmax_connections— true pool exhaustion.idlecount nearmax_connections— pool sized too aggressively, but not the immediate problem.
3. Top blocking queries¶
docker exec ebit-db psql -U ebit -d ebit -c "
SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event,
left(query, 200) AS query
FROM pg_stat_activity
WHERE state != 'idle' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC NULLS LAST
LIMIT 10;"
The longest-running active query is your prime suspect.
4. Lock graph — who is blocking whom¶
docker exec ebit-db psql -U ebit -d ebit -c "
SELECT blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
now() - blocked.query_start AS blocked_for
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
ORDER BY blocked_for DESC;"
If non-empty, you have lock contention. Note both PIDs — you'll need them in §Fix.
5. Top queries by total time¶
docker exec ebit-db psql -U ebit -d ebit -c "
SELECT calls,
total_exec_time::int AS total_ms,
mean_exec_time::int AS mean_ms,
left(query, 200) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;"
Match the offender against an EXPLAIN ANALYZE if you need the plan.
Fix¶
A. Kill the offending query (immediate mitigation, P0/P1)¶
# Cancel (lets the query exit cleanly)
docker exec ebit-db psql -U ebit -d ebit -c "SELECT pg_cancel_backend(<pid>);"
# Terminate (force; only if cancel doesn't return in 10s)
docker exec ebit-db psql -U ebit -d ebit -c "SELECT pg_terminate_backend(<pid>);"
Watch the connection count drop. If killed query was in a transaction, downstream ebit-api requests get a Prisma error and retry on next request — no manual rollback needed.
B. Resolve lock contention¶
Identify the blocking_pid from §Triage step 4. Decide:
- If it's a long-running admin query (analytics, bulk export): kill it.
- If it's an application transaction stuck on an external call (RPC waiting for
speed-roulette): the bug is upstream — kill the connection and patch the code to use a shorter transaction boundary.
Recurring lock contention is a code smell, not a one-off — file a follow-up.
C. Add a missing index (durable fix, requires deploy)¶
Build a candidate index in a migration (ebit-api/):
For the migration file, prefer CREATE INDEX CONCURRENTLY so the build doesn't block writes:
CONCURRENTLY cannot run inside a Prisma migration transaction — split into a follow-up migration that runs raw SQL outside of a BEGIN. See ../recipes/add-prisma-model.md for the pattern.
D. Increase max_connections — usually wrong, sometimes right¶
Most "we need more connections" requests are actually "we need to release connections faster" (stuck idle in transaction, oversized Prisma pool per replica). Before raising the cap:
- Confirm the real issue isn't an
idle in transactionleak (§Triage 2). - Confirm the Prisma pool size × replica count fits the current
max_connectionswith headroom for admin tools.
If you genuinely need more, edit the Postgres config (env var or postgresql.conf) — note that each connection costs ~10 MB of RSS, so doubling the cap doubles memory pressure. For Evospin's local stack the default is fine; for production, sizing is in {{TBD: terraform/perf/README.md once production-shaped sizing is documented}}.
E. Restart pgbouncer (if applicable)¶
Local stack does not run pgbouncer; production deployments often do. If pgbouncer is in front of Postgres and connections appear pinned to it: docker compose restart pgbouncer (or the equivalent ECS / K8s rollout). Confirm via:
F. Kick autovacuum¶
If pg_stat_user_tables shows large n_dead_tup and last_autovacuum is hours old:
For chronic bloat, lower autovacuum_vacuum_scale_factor for the table — see Postgres docs.
Verification¶
After applying a fix, confirm recovery:
- Connection count drops back to baseline (compare to a healthy snapshot from before the incident).
pg_stat_statements— the offender'smean_exec_timeis < 50 ms or it has been replaced by a different query at the top.- p95 latency in Grafana —
perf-testdashboard sign-in / bet / balance Trends back below thresholds (200 ms / 200 ms / 100 ms). - Replication lag (production only):
SELECT now() - pg_last_xact_replay_timestamp()on the replica returns < 5 s. Local stack runs no replica. - Run a smoke perf test against the affected service to confirm the fix holds under load:
Prevention¶
- Index discipline: every new query that filters or sorts on a column needs an index review at code review time. Catch it in PR, not in production.
- Transaction boundaries: never wrap a remote RPC, an external HTTP call, or a
setTimeoutinside a Prisma transaction. The transaction holds a connection; the remote call dictates the duration. - Bounded
ORDER BY: every paginated endpoint must paginate by an indexed column with a hardtakelimit. The bet-history flow (../flows/dropbet-bet-history.md) shows the pattern; admin-bets (../flows/admin-bets.md) was historically the offender. - Alert on
pg_stat_activitystate = 'idle in transaction'count > N for > 1 min — that always indicates an application-side leak. - Alert on cache hit ratio < 95% sustained for > 5 min.
- Schedule
VACUUM ANALYZEweekly via cron for the largest tables (bet,transaction,admin_action_log).
Cross-references¶
../runbooks/db-down.md— when high load tips into unreachability../runbooks/bullmq-job-stuck.md— workers blocked on DB show up here too../flows/dropbet-bet-place.md— the highest-volume DB write path../observability.md— Prisma instrumentation + spanmetrics setup../performance-test-report.md— DB-side baseline numbers from the latest perf run../../tests-perf/deep-metrics/postgres-introspect.sh— the live-state capture script used in §Triage 1