Skip to content

On-call Runbook

The single operational procedure for incident response on Evospin / dropbet — symptom classification, first-five-minute checklist, common-pattern triage, comms templates, post-incident. Read it once during onboarding week 1 (../onboarding/curriculum.md §"Operational toolkit"). Keep it open during every shift. Linked from the on-call channel topic.

Audience: anyone holding the pager — Tier 1 first responder, Tier 2 senior, or Tier 3 ebit-team escalation. The procedure is the same; the time budgets and authority differ — see support-model.md.

Severity rubric: this document is the canonical source for the P0/P1/P2/P3 severity scheme used by all operational on-call work. The customer-shareable security incident policy at ../security/security-incident-policy.md uses a separate Critical/High/Medium/Low scheme for external disclosure SLAs — those are deliberately different and not interchangeable.


1. Incident classification

Severity is set by the symptom, not by the cause. Pick the highest matching row.

Severity Symptom (any one is sufficient) Page on-call? Resolution target
P0 Site fully down for all users; payments broken; data loss in progress; security breach in progress; player wallets show wrong balances at scale Page immediately + status page red < 1 hour to mitigation
P1 Significant degraded service: error rate > 5% for > 5 min; sign-in unavailable; bet-place unavailable for > 1 game; admin panel cannot ban a fraud account Page immediately + status page yellow < 4 hours
P2 Partial degradation: a single non-critical endpoint failing; one queue lagging without business impact; observability gap (Loki / Jaeger blind); known finding actively exploited at low volume Notify on-call channel; ack within 15 min < 1 business day
P3 Cosmetic, transient, or single-user: typo in UI; one-off failed bet retried successfully; question or anomaly worth investigating Ticket in backlog Next backlog grooming

Symptom → severity reference (the most common ones):

Symptom Severity
Player can't sign in (any account) P1
Site returns 5xx on > 5% of requests P1
Player balance is wrong (one user) P2
Player balance is wrong (many users) P0
Bet placed but settlement didn't fire P2 (single) / P1 (rate ≥ 1/min)
Leaderboard not updating P3
Admin can't ban a user P2
Admin can't ban a user mid-fraud-event P1
Speed-roulette round stuck (no advance for > 90s) P1
Blackjack hand abandoned (player browser closed mid-hand) P3 (per-player) — known design (fund lockup, see ../flows/dropbet-blackjack.md)
Loki has no logs for ebit-api P2 (observability blind, not user-facing)
Jaeger has no traces for ebit-api P2
WS connections dropping at scale P1
OTel trace propagation gap to bj/speed-roulette Not an incident — known limitation (see §3 below)

When in doubt, escalate up one tier. Downgrading mid-incident is fine; under-classifying mid-incident wastes minutes you don't have.


2. First-response checklist — first 5 minutes

Run these in order. Don't skip ahead. Every step is annotated with a time budget.

Step 1 — Acknowledge in the incident channel (≤ 30 s)

Post in #oncall (or your customer-team equivalent — see support-model.md):

ack — investigating <one-line symptom>
sev: <P0/P1/P2/P3>
ic: <your name>

This claims the IC role. Until someone else explicitly takes IC, you own communication, scope, and resolution authority.

Step 2 — Open the two anchor Grafana dashboards (≤ 30 s)

Both should already be bookmarked. If not, fix that now:

Set both to "last 1 hour" auto-refresh 10s. The pattern of the spike — sudden, cliff, sawtooth, stair — narrows the cause faster than any logs query.

Step 3 — Query Jaeger for the last 15 min of error spans (≤ 1 min)

http://<grafana-host>/jaeger/  (or http://localhost:16686 locally)
  • Service: ebit-api (start there; iterate to other services if it's clearly elsewhere).
  • Lookback: 15m.
  • Tags: error=true (or sort by latency descending).

Click the three slowest / loudest traces. Note the service, the operation, the failing span, and the error attribute. If the same operation appears in all three, you have a hypothesis.

Step 4 — Check Loki for error logs (≤ 1 min)

In Grafana → Explore → Loki:

{service_name=~"ebit-.*"} |= "ERROR" | json | line_format "{{.service_name}} {{.body}}"

Narrow by:

  • {service_name="ebit-api"} |= "ERROR" |= "<a keyword from the trace>"
  • {service_name="ebit-api"} | json | trace_id="<the failing trace_id from step 3>" to pivot from the trace into its log lines.

If Loki has no logs at all, treat it as a separate P2 (../runbooks/loki-no-logs.md) — do not let observability blindness mask the underlying incident. Switch to docker logs / cloud-native logs and continue.

Step 5 — Identify recurrence (≤ 1 min)

Search prior incidents:

If a runbook matches: jump to it now, run its diagnosis section, follow its fix.

Step 6 — Page on-call lead if P0 or P1 (≤ 1 min)

For P0/P1, page the lead via PagerDuty and post in the incident channel:

PAGED: <lead-name>, sev <P0/P1>
ETA on second engineer: <time>

For P2/P3, no page; you continue solo until handoff or end of shift.

After step 6, you're past the first 5 minutes. The next phase is structured triage — see §3.


3. Common-pattern triage

Each subsection below is a complete triage tree for one recurring failure mode. Pick the one that matches your symptom from steps 3–4 above.

High DB load

Symptom: p95 climbs across ebit-api; prisma:client:operation spans dominate the waterfall; Postgres CPU > 80% in perf-system.

Triage:

# Top queries by total time (run on the SUT or production host)
docker compose exec ebit-db psql -U ebit -c "
  SELECT query, calls, total_exec_time::int AS total_ms, mean_exec_time::int AS mean_ms
  FROM pg_stat_statements
  ORDER BY total_exec_time DESC
  LIMIT 10;"

# Active waiters / locks
docker compose exec ebit-db psql -U ebit -c "
  SELECT pid, state, wait_event_type, wait_event, query
  FROM pg_stat_activity
  WHERE state != 'idle'
  ORDER BY query_start;"

Recovery:

  • Kill the offending PID(s) if a single query is hammering the DB: SELECT pg_terminate_backend(<pid>);.
  • Cache hot reads — see ../recipes/ for cache patterns.
  • See ../runbooks/db-high-load.md for the full triage tree (top-blocking-queries SQL, lock graph, connection state, autovacuum check) and ../runbooks/db-down.md when load tips into unreachability.

Redis memory pressure

Symptom: Redis dashboard shows memory > 80% of maxmemory; eviction rate > 0; cache hit rate dropping.

Triage:

docker compose exec ebit-redis-cache redis-cli -a cache INFO memory
docker compose exec ebit-redis-cache redis-cli -a cache --bigkeys
docker compose exec ebit-redis-cache redis-cli -a cache MEMORY STATS | head -20

Two Redis instances run — confirm which is the hot one:

Instance Port Password Holds
cache 6379 cache All app caches + most BullMQ queues
bot 6380 bot Only bot-related BullMQ queues

Recovery:

  • If it's BullMQ keys (bull:*) that have grown unboundedly, a queue's removeOnComplete/removeOnFail retention is misconfigured — fix and redeploy.
  • If it's app cache keys, raise maxmemory (config + restart) or aggressively expire — redis-cli FLUSHDB will work but causes a cold-cache latency spike.
  • See ../runbooks/bullmq-job-stuck.md for the queue-side checks.

BullMQ back-pressure

Symptom: bullmq dashboard shows queue waiting count climbing without ceiling; bet settlements visible in DB but side-effects (leaderboard, rakeback, affiliate) not firing.

Triage:

# Per-queue depth
docker compose exec ebit-redis-cache redis-cli -a cache --scan --pattern 'bull:*:wait' \
  | xargs -I{} sh -c 'echo -n "{} "; redis-cli -a cache LLEN {}'

Recovery:

  1. Identify the bottleneck: is it processor capacity (workers can't keep up) or processor failure (every job fails)? Check bull:<queue>:failed length.
  2. Pause new jobs into the slow queue if the upstream is something you can throttle (e.g., bot bets): use the BullMQ admin UI or bull:<queue>:meta paused flag.
  3. Drain: bring up additional workers — for ebit-api queues, scale ebit-api. For speed-roulette, scale ebit-speed-roulette.
  4. Resume when depth is back to baseline.

Full procedure: ../runbooks/bullmq-job-stuck.md.

WebSocket handshake storm

Symptom: ebit-rt shows a sudden surge of connection events; throttler hits climb; CPU pegged on rt service.

Triage:

  • Open the browser RUM dashboard — is the storm correlated with a player-facing event (promo, big-win social share, marketing push)?
  • Check the throttler logs in Loki: {service_name="ebit-rt"} |= "throttle".
  • Determine source IP distribution. If concentrated, suspected scraper / abusive client. If broad, legitimate traffic spike.

Recovery:

  • Legitimate spike: raise the throttler limit temporarily in the rt service env. Monitor; revert after the spike subsides.
  • Abusive client: block at the edge (CloudFront / WAF / ALB rule) — {{TBD: production edge rule procedure — depends on customer-team's CDN choice}}.
  • Scaling note: ebit-rt is currently single-replica. Scaling out requires @socket.io/redis-adapter (not installed) — see known weakness AF-3 in ../architecture.md. Until adapter is added, vertical scaling only.

Sign-in failure spike

Symptom: POST /auth/sign-in 4xx/5xx rate climbs; error logs show captcha or bcrypt errors.

Triage:

  • Captcha provider: hit the provider's status page. If down, captcha verification is the choke point.
  • Auth service: check ebit-api logs for bcrypt errors or pool exhaustion against the User table.
  • Lockout cascade: too many lockouts firing? See SF-002 — the lockout counter resets after TTL, which can lead to flapping.

Recovery:

Trace propagation gap (orphan traces in bj / speed-roulette)

Symptom: a trace that starts in the FE or ebit-api ends abruptly when the call hops over the Nest Redis pub/sub transport into ebit-bj or ebit-speed-roulette. The downstream work appears as a separate trace root.

This is not an incident. The Redis pub/sub transport doesn't propagate W3C traceparent — see ../adr/0005-no-traceparent-on-redis-rpc.md and the architecture doc's known-weakness register (AF-2).

Action: document in your timeline notes ("trace gap consistent with known AF-2; correlated downstream work via timestamp + user_id"). Continue triage; do not let the gap distract you. If you need to confirm the downstream call ran, search the orphan-root traces in Jaeger by service + time window + user attribute.


4. Communication template

The IC owns three audiences during an incident. Cover all three at every state change.

Status page (customer-facing)

For P0/P1 only. Update at acknowledge, at hypothesis confirmed, and at resolved.

Title: <one-line player-visible symptom>
Body: We are investigating <symptom>. <When known> reports indicate <area> is affected.
      No action required from players. We will update again at <ISO timestamp>.
Status: investigating | identified | monitoring | resolved

Customer comms (partner / customer team)

Send at acknowledge if the customer team isn't already in the incident channel.

Heads up: <severity> incident in progress on ebit / dropbet.
Symptom: <one-liner>
Started: <ISO timestamp>
IC: <name>
Suspected scope: <area>
Next update: <when>

Internal Slack (#oncall)

Higher cadence. Every 15 minutes for P0, every 30 min for P1, on hypothesis change for P2.

[TIMESTAMP] <ic-name>: <update — what changed, what we now know, next action, ETA>

Put the timeline in a thread, not the channel root. The channel root is for the running summary; the thread holds the play-by-play.


5. Post-incident

Within 24 hours of resolution, the IC produces:

RCA template (clone ../incidents/0000-template.md to ../incidents/<date>-<slug>.md)

# Incident <date> — <one-line title>

**Severity**: P<n>
**Duration**: <start ISO> → <end ISO> (<minutes>)
**IC**: <name>
**Detected by**: <alert / customer / engineer>

## Summary
<2–3 sentences. What broke, what users saw, how we fixed it.>

## Timeline
- HH:MM — symptom observed
- HH:MM — pager fired
- HH:MM — IC ack, dashboards opened
- HH:MM — hypothesis: <…>
- HH:MM — confirmed: <…>
- HH:MM — mitigation deployed
- HH:MM — full recovery

## Root cause
<the actual technical cause — be specific, cite file:line>

## Contributing factors
<anything that made the incident worse or longer than necessary — bullet list>

## What went well
<bulleted, blameless>

## What didn't go well
<bulleted, blameless>

## Action items
<numbered, each with an owner and a date — e.g., "Add alert for X by 2026-05-15 (owner: Y)">

Blameless retro

Schedule within 5 business days. Required attendees: IC, all engineers who participated, one observer from outside the team. Customer team representative invited for P0/P1.

The format follows the RCA doc but discussion is verbal — written record is the action items, not a transcript.

Action item tracking

Every action item in the RCA goes into the team's tracker ({{TBD: tracker — Jira / Linear / GitHub Issues, customer team to specify}}) with an explicit due date. The IC owns nudging items to closure; closure is reviewed at the next retro.


6. References