Runbook: Postgres unreachable¶
Symptom¶
ebit-api can't connect to Postgres. Visible as one or more of:
- All write endpoints return 5xx; reads either 5xx or stale-cache hits only.
- Loki:
{service_name=~"ebit-.*"} |= "connect ECONNREFUSED"or|= "Can't reach database server". - Prisma error in logs:
Error: P1001: Can't reach database server at \ebit-db:5432``. - Healthcheck fail:
docker compose psshowsebit-dbasunhealthy,exited, orrestarting; or shows it(healthy)but with the wrong IP / network from a recent compose recreate. - BullMQ queues stop draining;
ebit-apiworkers retry-loop on every job. - Grafana:
prisma-postgresdashboard goes blank;perf-systemebit-db CPU drops to zero (the container is gone) or pegs (the container is restart-looping).
This runbook lifts the day-7 outage drill (../onboarding/curriculum.md §7) into a real procedure.
Likely causes¶
- Container exited — OOM kill, manual
docker compose stop, host reboot, disk-full preventing WAL writes. - Container restart-looping — corrupt data volume; misconfigured
postgresql.conf; permission error on the data directory after a host file-system change. - Network partition — docker network was recreated (
docker compose up -d --force-recreate) butebit-apikept its old DNS-cached IP; firewall rule change in production. - Disk full — Postgres exits when WAL writes fail; check container logs for
No space left on device. - Single-node down vs entire stack down — the container can be the only thing broken, or it can be a symptom of a bigger outage (host down, EBS volume detached). Confirm scope before recovery.
Detection¶
- Healthcheck failures aggregate in the
service-overviewGrafana dashboard. - The
perf-systemdashboard's "Container CPU" panel forebit-dbgoing to zero is a strong signal of "container gone". - An alert on
prisma:client:operationspan error rate > 50% for 1 min is the highest-signal pre-built canary (configured in the production observability stack — for local, you observe directly in Jaeger).
First-response — confirm scope¶
Run all four checks in parallel; they take < 30 seconds combined.
1. Is the container present and running?¶
| Output | Diagnosis |
|---|---|
ebit-db running (healthy) |
DB is up — incident is elsewhere; re-triage. |
ebit-db exited (1) |
Container crashed; go to §Recovery A. |
ebit-db restarting |
Restart-loop; go to §Recovery B. |
| (no row) | Container was removed; go to §Recovery C. |
2. Are other services healthy?¶
If only ebit-db is sad: single-node failure. If many services are sad: entire-stack outage; check host first (disk, memory, network).
3. Quick host-level checks¶
df -h | head -10 # disk full?
free -m # OOM headroom
docker network ls | grep ebit # network present?
docker network inspect ebit-net | jq '.[0].Containers | keys'
4. Recent docker logs from the DB container¶
Look for: database system was shut down, FATAL: data directory ... has invalid permissions, PANIC: could not write to file, out of memory. Any of those is the root cause; tells you which Recovery branch to take.
Recovery¶
A. Container exited cleanly — restart sequence¶
The boot order matters: data store first, then the apps that read from it.
docker compose start ebit-db
docker compose logs -f ebit-db | sed '/database system is ready to accept connections/q'
The sed quits the follow once Postgres logs ready. Then:
Why restart the apps: Prisma's connection pool caches DNS and connection state; on DB restart the existing connections are dead but Prisma may not re-resolve until the next failure + retry. Restarting clears the pool cleanly.
Watch each app reach (healthy):
Typical recovery time: 5–10 seconds for the DB itself, 30–60 seconds for all apps to settle.
B. Container restart-looping — diagnose before flapping more¶
Stop the loop so you can read logs without the timer:
Common causes + fixes:
could not write to file ... No space left on device— host disk full. Free space (docker system prune -af --volumesif safe, otherwise grow the volume). Then restart per §A.data directory has invalid permissions— host UID/GID mismatch on the Postgres data volume. Fix withsudo chown -R 999:999 <volume mount>(Postgres official image runs as UID 999). Then restart per §A.PANIC: could not locate a valid checkpoint record— WAL corruption. This is data-loss territory; stop here and escalate to Tier 2 / Tier 3 before doing anything destructive. Point-in-time recovery from backup is the path forward; see §Recovery D.
C. Container removed — recreate from compose¶
docker compose up -d ebit-db
docker compose logs -f ebit-db | sed '/database system is ready to accept connections/q'
The ebit-prisma-migrate container will re-run migrations automatically the next time you up. Then restart the apps as in §A.
D. Data volume corrupted — point-in-time recovery (production only)¶
Local stack: re-seed from scratch with npm run db:reset from ebit-api/. Destroys all data. Acceptable for dev only.
Production: {{TBD: production point-in-time recovery procedure — depends on customer-team's backup/restore tooling. Typical shape: identify last good WAL position from backup metadata; restore base backup to a new volume; replay WAL up to last good timestamp; cut over with DNS / Prisma URL flip}}. The current ebit-api production stack does not ship with a preconfigured PITR procedure — this is a documented gap on the engineering team, see ../handover/oncall-runbook.md §3.
Verification¶
After recovery, confirm full path:
1. DB-side smoke¶
docker exec ebit-db psql -U ebit -d ebit -c "SELECT now(), version();"
docker exec ebit-db psql -U ebit -d ebit -c "SELECT count(*) FROM \"User\";"
Both must return without error.
2. App-side smoke from each container¶
for svc in ebit-api ebit-rt ebit-speed-roulette; do
echo "--- $svc ---"
docker exec $svc node -e "
const { PrismaClient } = require('@prisma/client');
const p = new PrismaClient();
p.\$queryRaw\`SELECT 1\`.then(() => console.log('$svc: OK')).catch(e => console.error('$svc: FAIL', e.message)).finally(() => p.\$disconnect());
"
done
(ebit-bj and ebit-bo use slightly different bootstrapping; for those just docker compose logs --tail 20 ebit-bj and look for the post-startup banner.)
3. End-to-end smoke¶
curl -sf http://localhost:4000/swagger > /dev/null && echo "api OK"
curl -sf http://localhost:3000 > /dev/null && echo "fe OK"
Then place a dice bet via the UI (../onboarding/day-one.md §7) and confirm:
- Bet succeeds.
- Trace appears in Jaeger for
POST /casino/games/house/dice/bet. bet_settled_queueBullMQ job processes (Grafanabullmqdashboard).
If all three pass, recovery is complete.
4. Check for stranded BullMQ jobs¶
While the DB was down, queue workers retried and may have built up a backlog. Drain check:
docker exec ebit-redis redis-cli -a cache LLEN bull:bet_settled_queue:wait
docker exec ebit-redis redis-cli -a cache LLEN bull:bet_settled_queue:failed
If failed > 0, follow bullmq-job-stuck.md §Fix to retry them.
Prevention¶
- Connection-retry pattern: every Prisma client must have explicit retry + backoff on
P1001so transient DB blips don't cascade to user-facing errors. Currently set inebit-api/libs/_prisma/src/client.ts:{{TBD: line — confirm config}}. The client retries5×with exponential backoff up to 30s. - Healthcheck
start_period:ebit-dbhealthcheck hasstart_period: 30sto allow boot. If you see flapping immediately aftercompose up, raisestart_periodrather than tweaking the healthcheck logic. - Replication setup (production):
{{TBD: streaming replication config — currently single-node Postgres in compose. Production target is primary + 1 hot standby with automatic failover. Engineering team to specify}}. - Disk monitoring: alert on Postgres data volume > 80% used. Alert on WAL volume > 70% used (writes continue, replay stalls).
- Runbook awareness: every new on-call engineer runs the §Recovery drill once during onboarding (
../onboarding/curriculum.md§7).
Cross-references¶
db-high-load.md— when load tips into unreachability before the container crashesbullmq-job-stuck.md— what to do with the queue backlog after recovery../handover/oncall-runbook.md— first-response procedure that brought you here../onboarding/curriculum.md§7 — the drill that exercises this runbook in dev../architecture/service-map.md— which apps depend onebit-db