On-call Runbook¶
The single operational procedure for incident response on Evospin / dropbet — symptom classification, first-five-minute checklist, common-pattern triage, comms templates, post-incident. Read it once during onboarding week 1 (../onboarding/curriculum.md §"Operational toolkit"). Keep it open during every shift. Linked from the on-call channel topic.
Audience: anyone holding the pager — Tier 1 first responder, Tier 2 senior, or Tier 3 ebit-team escalation. The procedure is the same; the time budgets and authority differ — see
support-model.md.Severity rubric: this document is the canonical source for the P0/P1/P2/P3 severity scheme used by all operational on-call work. The customer-shareable security incident policy at
../security/security-incident-policy.mduses a separate Critical/High/Medium/Low scheme for external disclosure SLAs — those are deliberately different and not interchangeable.
1. Incident classification¶
Severity is set by the symptom, not by the cause. Pick the highest matching row.
| Severity | Symptom (any one is sufficient) | Page on-call? | Resolution target |
|---|---|---|---|
| P0 | Site fully down for all users; payments broken; data loss in progress; security breach in progress; player wallets show wrong balances at scale | Page immediately + status page red | < 1 hour to mitigation |
| P1 | Significant degraded service: error rate > 5% for > 5 min; sign-in unavailable; bet-place unavailable for > 1 game; admin panel cannot ban a fraud account | Page immediately + status page yellow | < 4 hours |
| P2 | Partial degradation: a single non-critical endpoint failing; one queue lagging without business impact; observability gap (Loki / Jaeger blind); known finding actively exploited at low volume | Notify on-call channel; ack within 15 min | < 1 business day |
| P3 | Cosmetic, transient, or single-user: typo in UI; one-off failed bet retried successfully; question or anomaly worth investigating | Ticket in backlog | Next backlog grooming |
Symptom → severity reference (the most common ones):
| Symptom | Severity |
|---|---|
| Player can't sign in (any account) | P1 |
| Site returns 5xx on > 5% of requests | P1 |
| Player balance is wrong (one user) | P2 |
| Player balance is wrong (many users) | P0 |
| Bet placed but settlement didn't fire | P2 (single) / P1 (rate ≥ 1/min) |
| Leaderboard not updating | P3 |
| Admin can't ban a user | P2 |
| Admin can't ban a user mid-fraud-event | P1 |
| Speed-roulette round stuck (no advance for > 90s) | P1 |
| Blackjack hand abandoned (player browser closed mid-hand) | P3 (per-player) — known design (fund lockup, see ../flows/dropbet-blackjack.md) |
| Loki has no logs for ebit-api | P2 (observability blind, not user-facing) |
| Jaeger has no traces for ebit-api | P2 |
| WS connections dropping at scale | P1 |
| OTel trace propagation gap to bj/speed-roulette | Not an incident — known limitation (see §3 below) |
When in doubt, escalate up one tier. Downgrading mid-incident is fine; under-classifying mid-incident wastes minutes you don't have.
2. First-response checklist — first 5 minutes¶
Run these in order. Don't skip ahead. Every step is annotated with a time budget.
Step 1 — Acknowledge in the incident channel (≤ 30 s)¶
Post in #oncall (or your customer-team equivalent — see support-model.md):
This claims the IC role. Until someone else explicitly takes IC, you own communication, scope, and resolution authority.
Step 2 — Open the two anchor Grafana dashboards (≤ 30 s)¶
Both should already be bookmarked. If not, fix that now:
ebit-perf-test— k6-derived RED metrics + threshold lights (../../observability/grafana/provisioning/dashboards/perf-test.json)perf-system— host-level CPU / mem / disk / net from node_exporter (../../observability/grafana/provisioning/dashboards/perf-system.json)
Set both to "last 1 hour" auto-refresh 10s. The pattern of the spike — sudden, cliff, sawtooth, stair — narrows the cause faster than any logs query.
Step 3 — Query Jaeger for the last 15 min of error spans (≤ 1 min)¶
- Service:
ebit-api(start there; iterate to other services if it's clearly elsewhere). - Lookback: 15m.
- Tags:
error=true(or sort by latency descending).
Click the three slowest / loudest traces. Note the service, the operation, the failing span, and the error attribute. If the same operation appears in all three, you have a hypothesis.
Step 4 — Check Loki for error logs (≤ 1 min)¶
In Grafana → Explore → Loki:
Narrow by:
{service_name="ebit-api"} |= "ERROR" |= "<a keyword from the trace>"{service_name="ebit-api"} | json | trace_id="<the failing trace_id from step 3>"to pivot from the trace into its log lines.
If Loki has no logs at all, treat it as a separate P2 (../runbooks/loki-no-logs.md) — do not let observability blindness mask the underlying incident. Switch to docker logs / cloud-native logs and continue.
Step 5 — Identify recurrence (≤ 1 min)¶
Search prior incidents:
- Search
../incidents/for prior post-incident reviews — the directory holds RCAs in chronological order plus a../incidents/0000-template.mdyou should clone for new incidents. - For now, grep
../runbooks/for a runbook keyed to your symptom. Common matches: - "trace missing" →
../runbooks/trace-missing.md - "queue stuck" →
../runbooks/bullmq-job-stuck.md - "logs missing" →
../runbooks/loki-no-logs.md - "sign-in fails" →
../runbooks/login-fails-bcrypt.md - "captcha fails" →
../runbooks/recaptcha-fails-locally.md - "MFA / 2FA" →
../runbooks/2fa-unknown-secret.md - Also grep
git log --since='30 days ago' --onelinefor a recent commit that touches the failing area — recurrence often correlates with a recent deploy.
If a runbook matches: jump to it now, run its diagnosis section, follow its fix.
Step 6 — Page on-call lead if P0 or P1 (≤ 1 min)¶
For P0/P1, page the lead via PagerDuty and post in the incident channel:
For P2/P3, no page; you continue solo until handoff or end of shift.
After step 6, you're past the first 5 minutes. The next phase is structured triage — see §3.
3. Common-pattern triage¶
Each subsection below is a complete triage tree for one recurring failure mode. Pick the one that matches your symptom from steps 3–4 above.
High DB load¶
Symptom: p95 climbs across ebit-api; prisma:client:operation spans dominate the waterfall; Postgres CPU > 80% in perf-system.
Triage:
# Top queries by total time (run on the SUT or production host)
docker compose exec ebit-db psql -U ebit -c "
SELECT query, calls, total_exec_time::int AS total_ms, mean_exec_time::int AS mean_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;"
# Active waiters / locks
docker compose exec ebit-db psql -U ebit -c "
SELECT pid, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;"
Recovery:
- Kill the offending PID(s) if a single query is hammering the DB:
SELECT pg_terminate_backend(<pid>);. - Cache hot reads — see
../recipes/for cache patterns. - See
../runbooks/db-high-load.mdfor the full triage tree (top-blocking-queries SQL, lock graph, connection state, autovacuum check) and../runbooks/db-down.mdwhen load tips into unreachability.
Redis memory pressure¶
Symptom: Redis dashboard shows memory > 80% of maxmemory; eviction rate > 0; cache hit rate dropping.
Triage:
docker compose exec ebit-redis-cache redis-cli -a cache INFO memory
docker compose exec ebit-redis-cache redis-cli -a cache --bigkeys
docker compose exec ebit-redis-cache redis-cli -a cache MEMORY STATS | head -20
Two Redis instances run — confirm which is the hot one:
| Instance | Port | Password | Holds |
|---|---|---|---|
| cache | 6379 | cache |
All app caches + most BullMQ queues |
| bot | 6380 | bot |
Only bot-related BullMQ queues |
Recovery:
- If it's BullMQ keys (
bull:*) that have grown unboundedly, a queue'sremoveOnComplete/removeOnFailretention is misconfigured — fix and redeploy. - If it's app cache keys, raise
maxmemory(config + restart) or aggressively expire —redis-cli FLUSHDBwill work but causes a cold-cache latency spike. - See
../runbooks/bullmq-job-stuck.mdfor the queue-side checks.
BullMQ back-pressure¶
Symptom: bullmq dashboard shows queue waiting count climbing without ceiling; bet settlements visible in DB but side-effects (leaderboard, rakeback, affiliate) not firing.
Triage:
# Per-queue depth
docker compose exec ebit-redis-cache redis-cli -a cache --scan --pattern 'bull:*:wait' \
| xargs -I{} sh -c 'echo -n "{} "; redis-cli -a cache LLEN {}'
Recovery:
- Identify the bottleneck: is it processor capacity (workers can't keep up) or processor failure (every job fails)? Check
bull:<queue>:failedlength. - Pause new jobs into the slow queue if the upstream is something you can throttle (e.g., bot bets): use the BullMQ admin UI or
bull:<queue>:metapausedflag. - Drain: bring up additional workers — for ebit-api queues, scale
ebit-api. For speed-roulette, scaleebit-speed-roulette. - Resume when depth is back to baseline.
Full procedure: ../runbooks/bullmq-job-stuck.md.
WebSocket handshake storm¶
Symptom: ebit-rt shows a sudden surge of connection events; throttler hits climb; CPU pegged on rt service.
Triage:
- Open the browser RUM dashboard — is the storm correlated with a player-facing event (promo, big-win social share, marketing push)?
- Check the throttler logs in Loki:
{service_name="ebit-rt"} |= "throttle". - Determine source IP distribution. If concentrated, suspected scraper / abusive client. If broad, legitimate traffic spike.
Recovery:
- Legitimate spike: raise the throttler limit temporarily in the rt service env. Monitor; revert after the spike subsides.
- Abusive client: block at the edge (CloudFront / WAF / ALB rule) —
{{TBD: production edge rule procedure — depends on customer-team's CDN choice}}. - Scaling note:
ebit-rtis currently single-replica. Scaling out requires@socket.io/redis-adapter(not installed) — see known weakness AF-3 in../architecture.md. Until adapter is added, vertical scaling only.
Sign-in failure spike¶
Symptom: POST /auth/sign-in 4xx/5xx rate climbs; error logs show captcha or bcrypt errors.
Triage:
- Captcha provider: hit the provider's status page. If down, captcha verification is the choke point.
- Auth service: check
ebit-apilogs forbcrypterrors or pool exhaustion against theUsertable. - Lockout cascade: too many lockouts firing? See SF-002 — the lockout counter resets after TTL, which can lead to flapping.
Recovery:
- Captcha down: bypass via env (
CAPTCHA_DISABLED=trueor the equivalent — only with security/IC approval; this is a documented break-glass). - Auth service slow: scale
ebit-api; investigate DB connection-pool sizing. - See
../runbooks/login-fails-bcrypt.mdand../runbooks/recaptcha-fails-locally.md.
Trace propagation gap (orphan traces in bj / speed-roulette)¶
Symptom: a trace that starts in the FE or ebit-api ends abruptly when the call hops over the Nest Redis pub/sub transport into ebit-bj or ebit-speed-roulette. The downstream work appears as a separate trace root.
This is not an incident. The Redis pub/sub transport doesn't propagate W3C traceparent — see ../adr/0005-no-traceparent-on-redis-rpc.md and the architecture doc's known-weakness register (AF-2).
Action: document in your timeline notes ("trace gap consistent with known AF-2; correlated downstream work via timestamp + user_id"). Continue triage; do not let the gap distract you. If you need to confirm the downstream call ran, search the orphan-root traces in Jaeger by service + time window + user attribute.
4. Communication template¶
The IC owns three audiences during an incident. Cover all three at every state change.
Status page (customer-facing)¶
For P0/P1 only. Update at acknowledge, at hypothesis confirmed, and at resolved.
Title: <one-line player-visible symptom>
Body: We are investigating <symptom>. <When known> reports indicate <area> is affected.
No action required from players. We will update again at <ISO timestamp>.
Status: investigating | identified | monitoring | resolved
Customer comms (partner / customer team)¶
Send at acknowledge if the customer team isn't already in the incident channel.
Heads up: <severity> incident in progress on ebit / dropbet.
Symptom: <one-liner>
Started: <ISO timestamp>
IC: <name>
Suspected scope: <area>
Next update: <when>
Internal Slack (#oncall)¶
Higher cadence. Every 15 minutes for P0, every 30 min for P1, on hypothesis change for P2.
Put the timeline in a thread, not the channel root. The channel root is for the running summary; the thread holds the play-by-play.
5. Post-incident¶
Within 24 hours of resolution, the IC produces:
RCA template (clone ../incidents/0000-template.md to ../incidents/<date>-<slug>.md)¶
# Incident <date> — <one-line title>
**Severity**: P<n>
**Duration**: <start ISO> → <end ISO> (<minutes>)
**IC**: <name>
**Detected by**: <alert / customer / engineer>
## Summary
<2–3 sentences. What broke, what users saw, how we fixed it.>
## Timeline
- HH:MM — symptom observed
- HH:MM — pager fired
- HH:MM — IC ack, dashboards opened
- HH:MM — hypothesis: <…>
- HH:MM — confirmed: <…>
- HH:MM — mitigation deployed
- HH:MM — full recovery
## Root cause
<the actual technical cause — be specific, cite file:line>
## Contributing factors
<anything that made the incident worse or longer than necessary — bullet list>
## What went well
<bulleted, blameless>
## What didn't go well
<bulleted, blameless>
## Action items
<numbered, each with an owner and a date — e.g., "Add alert for X by 2026-05-15 (owner: Y)">
Blameless retro¶
Schedule within 5 business days. Required attendees: IC, all engineers who participated, one observer from outside the team. Customer team representative invited for P0/P1.
The format follows the RCA doc but discussion is verbal — written record is the action items, not a transcript.
Action item tracking¶
Every action item in the RCA goes into the team's tracker ({{TBD: tracker — Jira / Linear / GitHub Issues, customer team to specify}}) with an explicit due date. The IC owns nudging items to closure; closure is reviewed at the next retro.
6. References¶
../runbooks/— symptom-keyed cheat sheets (linked above per-pattern)../architecture/— service map + tracing flow + sequence diagrams../observability.md— OTel collector, Grafana, Jaeger, Loki configuration../security-register.md— known findings (some of which become incident causes)support-model.md— service tiers, response SLAs, who-does-whatescalation-matrix.md— severity × time → who's notified