Troubleshooting Runbooks¶
Quick-reference guides for common local-dev and infrastructure issues across the Evospin platform.
Index¶
| Runbook | Symptom |
|---|---|
| Trace missing from Jaeger | Request spans don't appear in the Jaeger UI |
| BullMQ job stuck | Queued job stays in waiting, active, or failed state indefinitely |
| Loki missing logs | Grafana Logs dashboard shows no results for a service |
| Login fails (bcrypt) | "Invalid credentials" despite correct password on seeded user |
| reCAPTCHA fails locally | Auth endpoints return captcha validation errors in local dev |
| 2FA unknown secret | Admin account requires MFA and the TOTP secret is lost |
| npm EACCES on host | npm install fails with permission denied on node_modules/ |
| Postgres under high load | DB CPU pegged; p95 climbs across the API; Prisma errors / pool exhaustion |
| Postgres unreachable | Can't reach database server; ebit-db container exited or restart-looping |
| Speed-roulette round stuck | Shared roulette round doesn't advance for > 90 s; state queue wedged |
| Redis under memory pressure | cache Redis RSS climbing toward mem_limit; OOM command not allowed; eviction rate spike |
| Captcha provider down (break-glass) | Sign-up / sign-in failing with AUTH_INVALID_CAPTCHA at scale; reCAPTCHA upstream impaired |
| ebit-rt connection saturation | Websocket clients refused (Too many connections / Too many requests); rt CPU pegged; single-replica ceiling hit |
Structure¶
Every runbook follows the same template:
- Symptom — what the developer sees
- Likely cause — the most common root cause
- Diagnosis — commands to confirm the hypothesis
- Fix — step-by-step resolution (multiple options when applicable)
- Prevention — how to avoid recurrence
Adding a new runbook¶
- Create
docs/runbooks/<short-slug>.mdusing the template above - Add a row to the index table in this file
- Keep the slug lowercase, hyphenated, and under 30 characters