Skip to content

Runbook: Postgres unreachable

Symptom

ebit-api can't connect to Postgres. Visible as one or more of:

  • All write endpoints return 5xx; reads either 5xx or stale-cache hits only.
  • Loki: {service_name=~"ebit-.*"} |= "connect ECONNREFUSED" or |= "Can't reach database server".
  • Prisma error in logs: Error: P1001: Can't reach database server at \ebit-db:5432``.
  • Healthcheck fail: docker compose ps shows ebit-db as unhealthy, exited, or restarting; or shows it (healthy) but with the wrong IP / network from a recent compose recreate.
  • BullMQ queues stop draining; ebit-api workers retry-loop on every job.
  • Grafana: prisma-postgres dashboard goes blank; perf-system ebit-db CPU drops to zero (the container is gone) or pegs (the container is restart-looping).

This runbook lifts the day-7 outage drill (../onboarding/curriculum.md §7) into a real procedure.

Likely causes

  1. Container exited — OOM kill, manual docker compose stop, host reboot, disk-full preventing WAL writes.
  2. Container restart-looping — corrupt data volume; misconfigured postgresql.conf; permission error on the data directory after a host file-system change.
  3. Network partition — docker network was recreated (docker compose up -d --force-recreate) but ebit-api kept its old DNS-cached IP; firewall rule change in production.
  4. Disk full — Postgres exits when WAL writes fail; check container logs for No space left on device.
  5. Single-node down vs entire stack down — the container can be the only thing broken, or it can be a symptom of a bigger outage (host down, EBS volume detached). Confirm scope before recovery.

Detection

  • Healthcheck failures aggregate in the service-overview Grafana dashboard.
  • The perf-system dashboard's "Container CPU" panel for ebit-db going to zero is a strong signal of "container gone".
  • An alert on prisma:client:operation span error rate > 50% for 1 min is the highest-signal pre-built canary (configured in the production observability stack — for local, you observe directly in Jaeger).

First-response — confirm scope

Run all four checks in parallel; they take < 30 seconds combined.

1. Is the container present and running?

docker compose ps ebit-db
Output Diagnosis
ebit-db running (healthy) DB is up — incident is elsewhere; re-triage.
ebit-db exited (1) Container crashed; go to §Recovery A.
ebit-db restarting Restart-loop; go to §Recovery B.
(no row) Container was removed; go to §Recovery C.

2. Are other services healthy?

docker compose ps

If only ebit-db is sad: single-node failure. If many services are sad: entire-stack outage; check host first (disk, memory, network).

3. Quick host-level checks

df -h | head -10                       # disk full?
free -m                                # OOM headroom
docker network ls | grep ebit          # network present?
docker network inspect ebit-net | jq '.[0].Containers | keys'

4. Recent docker logs from the DB container

docker compose logs --tail=200 ebit-db

Look for: database system was shut down, FATAL: data directory ... has invalid permissions, PANIC: could not write to file, out of memory. Any of those is the root cause; tells you which Recovery branch to take.

Recovery

A. Container exited cleanly — restart sequence

The boot order matters: data store first, then the apps that read from it.

docker compose start ebit-db
docker compose logs -f ebit-db | sed '/database system is ready to accept connections/q'

The sed quits the follow once Postgres logs ready. Then:

docker compose restart ebit-api ebit-rt ebit-bj ebit-bo ebit-speed-roulette

Why restart the apps: Prisma's connection pool caches DNS and connection state; on DB restart the existing connections are dead but Prisma may not re-resolve until the next failure + retry. Restarting clears the pool cleanly.

Watch each app reach (healthy):

docker compose ps | grep ebit-

Typical recovery time: 5–10 seconds for the DB itself, 30–60 seconds for all apps to settle.

B. Container restart-looping — diagnose before flapping more

Stop the loop so you can read logs without the timer:

docker compose stop ebit-db
docker compose logs --tail=500 ebit-db > /tmp/ebit-db-crash.log

Common causes + fixes:

  • could not write to file ... No space left on device — host disk full. Free space (docker system prune -af --volumes if safe, otherwise grow the volume). Then restart per §A.
  • data directory has invalid permissions — host UID/GID mismatch on the Postgres data volume. Fix with sudo chown -R 999:999 <volume mount> (Postgres official image runs as UID 999). Then restart per §A.
  • PANIC: could not locate a valid checkpoint record — WAL corruption. This is data-loss territory; stop here and escalate to Tier 2 / Tier 3 before doing anything destructive. Point-in-time recovery from backup is the path forward; see §Recovery D.

C. Container removed — recreate from compose

docker compose up -d ebit-db
docker compose logs -f ebit-db | sed '/database system is ready to accept connections/q'

The ebit-prisma-migrate container will re-run migrations automatically the next time you up. Then restart the apps as in §A.

D. Data volume corrupted — point-in-time recovery (production only)

Local stack: re-seed from scratch with npm run db:reset from ebit-api/. Destroys all data. Acceptable for dev only.

Production: {{TBD: production point-in-time recovery procedure — depends on customer-team's backup/restore tooling. Typical shape: identify last good WAL position from backup metadata; restore base backup to a new volume; replay WAL up to last good timestamp; cut over with DNS / Prisma URL flip}}. The current ebit-api production stack does not ship with a preconfigured PITR procedure — this is a documented gap on the engineering team, see ../handover/oncall-runbook.md §3.

Verification

After recovery, confirm full path:

1. DB-side smoke

docker exec ebit-db psql -U ebit -d ebit -c "SELECT now(), version();"
docker exec ebit-db psql -U ebit -d ebit -c "SELECT count(*) FROM \"User\";"

Both must return without error.

2. App-side smoke from each container

for svc in ebit-api ebit-rt ebit-speed-roulette; do
  echo "--- $svc ---"
  docker exec $svc node -e "
    const { PrismaClient } = require('@prisma/client');
    const p = new PrismaClient();
    p.\$queryRaw\`SELECT 1\`.then(() => console.log('$svc: OK')).catch(e => console.error('$svc: FAIL', e.message)).finally(() => p.\$disconnect());
  "
done

(ebit-bj and ebit-bo use slightly different bootstrapping; for those just docker compose logs --tail 20 ebit-bj and look for the post-startup banner.)

3. End-to-end smoke

curl -sf http://localhost:4000/swagger > /dev/null && echo "api OK"
curl -sf http://localhost:3000          > /dev/null && echo "fe OK"

Then place a dice bet via the UI (../onboarding/day-one.md §7) and confirm:

  • Bet succeeds.
  • Trace appears in Jaeger for POST /casino/games/house/dice/bet.
  • bet_settled_queue BullMQ job processes (Grafana bullmq dashboard).

If all three pass, recovery is complete.

4. Check for stranded BullMQ jobs

While the DB was down, queue workers retried and may have built up a backlog. Drain check:

docker exec ebit-redis redis-cli -a cache LLEN bull:bet_settled_queue:wait
docker exec ebit-redis redis-cli -a cache LLEN bull:bet_settled_queue:failed

If failed > 0, follow bullmq-job-stuck.md §Fix to retry them.

Prevention

  • Connection-retry pattern: every Prisma client must have explicit retry + backoff on P1001 so transient DB blips don't cascade to user-facing errors. Currently set in ebit-api/libs/_prisma/src/client.ts:{{TBD: line — confirm config}}. The client retries with exponential backoff up to 30s.
  • Healthcheck start_period: ebit-db healthcheck has start_period: 30s to allow boot. If you see flapping immediately after compose up, raise start_period rather than tweaking the healthcheck logic.
  • Replication setup (production): {{TBD: streaming replication config — currently single-node Postgres in compose. Production target is primary + 1 hot standby with automatic failover. Engineering team to specify}}.
  • Disk monitoring: alert on Postgres data volume > 80% used. Alert on WAL volume > 70% used (writes continue, replay stalls).
  • Runbook awareness: every new on-call engineer runs the §Recovery drill once during onboarding (../onboarding/curriculum.md §7).

Cross-references