Runbook: Postgres under high load¶

Symptom¶

ebit-api p95 latency climbs across the board; one or more of:

prisma:client:operation spans dominate the trace waterfall — most of the request budget is in DB calls.
Postgres container CPU > 80% in the perf-system Grafana dashboard (Container CPU panel).
Container memory RSS climbing past shared_buffers headroom (Container Memory RSS panel).
New requests pile up in the Nest event loop while existing ones wait for a DB connection.
BullMQ queues lag because their workers can't open a Prisma connection inside removeOnComplete/removeOnFail retention.

This runbook covers the four shapes of "DB is hot": connection-pool exhaustion, slow query, lock contention, autovacuum stalled.

Likely causes¶

Missing index on a hot column — a recently shipped feature filters or joins on an unindexed column; the query plan turns into a sequential scan.
Unbounded ORDER BY — admin or bet-history endpoint sorts a large table without a matching index, falling back to in-memory sort.
Lock contention — two transactions hold incompatible locks; one blocks the chain.
Autovacuum stalled — dead tuples accumulate; the planner picks bad plans; table/index bloat grows.
Connection-pool exhaustion — the Prisma client default pool is small; long-running transactions or stuck queries hold connections, new requests wait.

Detection¶

Grafana — perf-system: Container CPU panel for ebit-db; Container Memory RSS panel; both should stay < 70% under steady-state.
Grafana — prisma-postgres dashboard: top-N slowest queries; Postgres lock-wait counter; rate of long transactions.
Loki: {service_name="ebit-api"} |= "PrismaClientKnownRequestError" or "Timed out fetching a new connection".
Jaeger: filter ebit-api traces by error=true or sort by latency descending; the failing span will be prisma:client:operation with the model + method as attributes.

Triage¶

1. Snapshot live state with the introspection script¶

cd ~/ebit/tests-perf/deep-metrics
./postgres-introspect.sh 60 5    # 60s @ 5s sampling

This captures the state of pg_stat_activity (state counts, wait events), non-granted locks, pg_stat_database deltas (tup_returned/sec, tup_fetched/sec, blks_hit/sec, blks_read/sec, cache hit ratio), and autovacuum activity into /tmp/pg-introspect-<ts>/. Skim summary.txt first — falling cache hit ratio + rising blks_read/sec tells you which side of "hot" you're on (CPU vs IO).

2. Connection count by state¶

docker exec ebit-db psql -U ebit -d ebit -c "
  SELECT state, count(*)
  FROM pg_stat_activity
  WHERE datname = 'ebit'
  GROUP BY state ORDER BY 2 DESC;"

Watch for:

Many idle in transaction rows — application-side bug, a transaction was opened and never closed. This eats connections faster than anything else.
active count near max_connections — true pool exhaustion.
idle count near max_connections — pool sized too aggressively, but not the immediate problem.

3. Top blocking queries¶

docker exec ebit-db psql -U ebit -d ebit -c "
  SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event,
         left(query, 200) AS query
  FROM pg_stat_activity
  WHERE state != 'idle' AND query NOT ILIKE '%pg_stat_activity%'
  ORDER BY duration DESC NULLS LAST
  LIMIT 10;"

The longest-running active query is your prime suspect.

4. Lock graph — who is blocking whom¶

docker exec ebit-db psql -U ebit -d ebit -c "
  SELECT blocked.pid AS blocked_pid,
         blocked.query AS blocked_query,
         blocking.pid AS blocking_pid,
         blocking.query AS blocking_query,
         now() - blocked.query_start AS blocked_for
  FROM pg_stat_activity blocked
  JOIN pg_stat_activity blocking
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
  ORDER BY blocked_for DESC;"

If non-empty, you have lock contention. Note both PIDs — you'll need them in §Fix.

5. Top queries by total time¶

docker exec ebit-db psql -U ebit -d ebit -c "
  SELECT calls,
         total_exec_time::int AS total_ms,
         mean_exec_time::int AS mean_ms,
         left(query, 200) AS query
  FROM pg_stat_statements
  ORDER BY total_exec_time DESC
  LIMIT 10;"

Match the offender against an EXPLAIN ANALYZE if you need the plan.

Fix¶

A. Kill the offending query (immediate mitigation, P0/P1)¶

# Cancel (lets the query exit cleanly)
docker exec ebit-db psql -U ebit -d ebit -c "SELECT pg_cancel_backend(<pid>);"

# Terminate (force; only if cancel doesn't return in 10s)
docker exec ebit-db psql -U ebit -d ebit -c "SELECT pg_terminate_backend(<pid>);"

Watch the connection count drop. If killed query was in a transaction, downstream ebit-api requests get a Prisma error and retry on next request — no manual rollback needed.

B. Resolve lock contention¶

Identify the blocking_pid from §Triage step 4. Decide:

If it's a long-running admin query (analytics, bulk export): kill it.
If it's an application transaction stuck on an external call (RPC waiting for speed-roulette): the bug is upstream — kill the connection and patch the code to use a shorter transaction boundary.

docker exec ebit-db psql -U ebit -d ebit -c "SELECT pg_terminate_backend(<blocking_pid>);"

Recurring lock contention is a code smell, not a one-off — file a follow-up.

C. Add a missing index (durable fix, requires deploy)¶

Build a candidate index in a migration (ebit-api/):

cd ~/ebit/ebit-api
npm run db:migrate:dev -- --name add-<table>-<column>-index

For the migration file, prefer CREATE INDEX CONCURRENTLY so the build doesn't block writes:

CREATE INDEX CONCURRENTLY IF NOT EXISTS "<table>_<column>_idx"
ON "<table>" ("<column>");

CONCURRENTLY cannot run inside a Prisma migration transaction — split into a follow-up migration that runs raw SQL outside of a BEGIN. See ../recipes/add-prisma-model.md for the pattern.

D. Increase `max_connections` — usually wrong, sometimes right¶

Most "we need more connections" requests are actually "we need to release connections faster" (stuck idle in transaction, oversized Prisma pool per replica). Before raising the cap:

Confirm the real issue isn't an idle in transaction leak (§Triage 2).
Confirm the Prisma pool size × replica count fits the current max_connections with headroom for admin tools.

If you genuinely need more, edit the Postgres config (env var or postgresql.conf) — note that each connection costs ~10 MB of RSS, so doubling the cap doubles memory pressure. For Evospin's local stack the default is fine; for production, sizing is in {{TBD: terraform/perf/README.md once production-shaped sizing is documented}}.

E. Restart pgbouncer (if applicable)¶

Local stack does not run pgbouncer; production deployments often do. If pgbouncer is in front of Postgres and connections appear pinned to it: docker compose restart pgbouncer (or the equivalent ECS / K8s rollout). Confirm via:

docker exec ebit-pgbouncer psql -h localhost -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"

F. Kick autovacuum¶

If pg_stat_user_tables shows large n_dead_tup and last_autovacuum is hours old:

docker exec ebit-db psql -U ebit -d ebit -c "VACUUM ANALYZE \"<table>\";"

For chronic bloat, lower autovacuum_vacuum_scale_factor for the table — see Postgres docs.

Verification¶

After applying a fix, confirm recovery:

Connection count drops back to baseline (compare to a healthy snapshot from before the incident).
pg_stat_statements — the offender's mean_exec_time is < 50 ms or it has been replaced by a different query at the top.
p95 latency in Grafana — perf-test dashboard sign-in / bet / balance Trends back below thresholds (200 ms / 200 ms / 100 ms).
Replication lag (production only): SELECT now() - pg_last_xact_replay_timestamp() on the replica returns < 5 s. Local stack runs no replica.
Run a smoke perf test against the affected service to confirm the fix holds under load:
```
BASE_URL=http://localhost:4000 k6 run ~/ebit/tests-perf/profiles/smoke.js
```

Prevention¶

Index discipline: every new query that filters or sorts on a column needs an index review at code review time. Catch it in PR, not in production.
Transaction boundaries: never wrap a remote RPC, an external HTTP call, or a setTimeout inside a Prisma transaction. The transaction holds a connection; the remote call dictates the duration.
Bounded ORDER BY: every paginated endpoint must paginate by an indexed column with a hard take limit. The bet-history flow (../flows/dropbet-bet-history.md) shows the pattern; admin-bets (../flows/admin-bets.md) was historically the offender.
Alert on pg_stat_activity state = 'idle in transaction' count > N for > 1 min — that always indicates an application-side leak.
Alert on cache hit ratio < 95% sustained for > 5 min.
Schedule VACUUM ANALYZE weekly via cron for the largest tables (bet, transaction, admin_action_log).

Cross-references¶

../runbooks/db-down.md — when high load tips into unreachability
../runbooks/bullmq-job-stuck.md — workers blocked on DB show up here too
../flows/dropbet-bet-place.md — the highest-volume DB write path
../observability.md — Prisma instrumentation + spanmetrics setup
../performance-test-report.md — DB-side baseline numbers from the latest perf run
../../tests-perf/deep-metrics/postgres-introspect.sh — the live-state capture script used in §Triage 1