Runbook: Speed-roulette round stuck¶

Symptom¶

The shared speed-roulette round doesn't advance. Visible as one or more of:

Players in the speed-roulette UI see the timer frozen on ACCEPTING_BETS (or any other state) for > 90 seconds.
WebSocket SpeedRouletteStateUpdate broadcasts stop firing — confirm in browser DevTools → Network → WS tab.
bullmq Grafana dashboard shows speed-roulette:state queue with active=0 and delayed=0 — no job is queued, the loop has stopped.
bullmq dashboard shows speed-roulette:state queue with failed > 0 and a job that exhausted its attempts: 10 retries.
p95 of "round duration" exceeds the design budget of ~27 s for more than two consecutive rounds.

The round design and lifecycle are documented in ../flows/dropbet-speed-roulette.md §1 (timeConfig). The state queue is a single BullMQ worker with concurrency: 1 (apps/speed-roulette/src/roulette/state/roulette-state.processor.ts:23-25).

Likely causes¶

Job exhausted retries without enqueuing a follow-up — the process() method ends with addStateJob(stateModel.getJobData()). If process() throws on every attempt, attempts: 10 runs out; onFailed only enqueues an ERROR state if attemptsMade > attempts (off-by-one — see code at lines 72–88). A clean exhaust without ERROR follow-up = the chain dies.
concurrency: 1 deadlock — BullMQ holds the worker on the active job. If the active job is stuck on a remote call (EOS block wait, walletClient.play RPC over Redis pub/sub) the queue can't advance.
startIfNotStarted only bootstraps an empty queue — if the queue has anything (delayed/active/waiting), it returns early. If the chain dies while a job is in delayed, the bootstrap won't help.
EOS provider unreachable — WAITING_BLOCK waits for a real or stubbed EOS block. Provider down → EosWaitBlockTimeoutError → retried; if it never recovers within attempts: 10, the chain dies.
Speed-roulette container crash mid-job — the active job is parked in bull:speed-roulette:state:active with no live worker; the queue blocks until something moves it.

Detection¶

Grafana — bullmq dashboard: speed-roulette:state queue depth panel. Healthy: cycles between 0 and 1. Stuck: stays at 0 with no completed-jobs flow, or stays at 1 (active) with processedOn minutes old.
Grafana — service-overview: ebit-speed-roulette request rate drops to zero on round-state transitions while inbound bet RPCs continue (or also drop if all clients leave).
Loki: {service_name="ebit-speed-roulette"} |= "Roulette job" |= "failed" — exhausted retry messages.
Alert: bull:speed-roulette:state:active LLEN > 0 with no bull:speed-roulette:state:completed increment for > 60 s.

Triage¶

1. Inspect the state-queue contents¶

docker exec ebit-redis redis-cli -a cache <<'EOF'
LLEN bull:speed-roulette:state:wait
LLEN bull:speed-roulette:state:active
ZCARD bull:speed-roulette:state:delayed
ZCARD bull:speed-roulette:state:failed
LRANGE bull:speed-roulette:state:active 0 -1
EOF

Healthy: at any instant exactly one of delayed or active has a single entry, others are zero.

Stuck: all four are zero (chain died), or active holds a job whose processedOn is far in the past.

2. Dump the active job's payload + timestamps¶

ACTIVE_JOB=$(docker exec ebit-redis redis-cli -a cache LRANGE bull:speed-roulette:state:active 0 0 | tr -d '\r')
docker exec ebit-redis redis-cli -a cache HGETALL bull:speed-roulette:state:$ACTIVE_JOB

Inspect:

data — JSON with gameId, state.current (ACCEPTING_BETS / WAITING_BLOCK / ROLLING / FINISHED / ERROR).
processedOn — Unix ms when the worker picked it up. If now - processedOn > 60_000, the worker is wedged on this job.
attemptsMade vs opts.attempts — close to 10 means exhaustion is imminent; equal to 10 means it just exhausted.
failedReason — present on the failed-job path; tells you the exception class (EosWaitBlockTimeoutError, EosCurrentBlockOutdatedError, or generic).

3. Check the in-flight game in Postgres¶

docker exec ebit-db psql -U ebit -d ebit -c "
  SELECT id, state, \"createdAt\", \"endedAt\",
         \"eosBlockNum\", \"eosBlockId\", roll, color
  FROM speed_roulette.\"speed_roulette_game\"
  ORDER BY \"createdAt\" DESC
  LIMIT 3;"

The newest row is the wedged round. Note its id (you'll need it for §Fix C) and its state. If state = ACCEPTING_BETS and createdAt is > 30 s old, the round never advanced past bet collection. If state = WAITING_BLOCK, EOS is the suspect.

4. Check speed-roulette container health¶

docker compose ps ebit-speed-roulette
docker compose logs --tail=100 ebit-speed-roulette | grep -iE "Roulette|error|failed|EOS"

A crashed container is the simplest case — restart resolves the deadlock (§Fix B).

Fix¶

A. Drain a stuck active job (queue is wedged but worker is alive)¶

ACTIVE_JOB=$(docker exec ebit-redis redis-cli -a cache LRANGE bull:speed-roulette:state:active 0 0 | tr -d '\r')

# Move it to failed (bypasses BullMQ's normal failure path; you accept losing this round)
docker exec ebit-redis redis-cli -a cache LREM bull:speed-roulette:state:active 0 $ACTIVE_JOB
docker exec ebit-redis redis-cli -a cache HSET bull:speed-roulette:state:$ACTIVE_JOB state failed
docker exec ebit-redis redis-cli -a cache LPUSH bull:speed-roulette:state:failed $ACTIVE_JOB

Then restart the worker so startIfNotStarted re-bootstraps an empty queue:

docker compose restart ebit-speed-roulette

The module's onApplicationBootstrap hook calls startIfNotStarted (roulette.module.ts:49) which only bootstraps when delayed+active+waiting are all zero. The drain above ensures that.

B. Restart the speed-roulette container (worker crashed)¶

docker compose restart ebit-speed-roulette
docker compose logs -f ebit-speed-roulette | sed '/Roulette state is .* active\|Starting game/q'

The sed quits once you see either Roulette state is active (queue picked up where it left off) or Starting game: <id> (fresh bootstrap). If the state queue had contents (active/delayed/waiting), the worker resumes processing them. If empty, startIfNotStarted enqueues a first job.

C. Manually settle in-flight bets for a lost round¶

If you abandoned a round in §Fix A, refund the players who placed bets in that round. Replace <gameId> with the wedged-round ID from §Triage 3:

docker exec ebit-db psql -U ebit -d ebit <<'EOF'
BEGIN;

-- 1. Find affected bets
SELECT b.id, b."userId", b."currencyId", b."betAmount"
FROM speed_roulette."speed_roulette_bet" srb
JOIN public."Bet" b ON b.id = srb."betId"
WHERE srb."gameId" = '<gameId>'
  AND b."settledAt" IS NULL;

-- 2. Refund: credit user_balance, mark bet REFUNDED, write transaction row
-- (run after eyeballing the row count above)
UPDATE public."UserBalance" ub
SET amount = amount + b."betAmount"
FROM public."Bet" b, speed_roulette."speed_roulette_bet" srb
WHERE srb."gameId" = '<gameId>'
  AND b.id = srb."betId"
  AND ub."userId" = b."userId"
  AND ub."currencyId" = b."currencyId"
  AND b."settledAt" IS NULL;

UPDATE public."Bet" b
SET "settledAt" = now(),
    payout = "betAmount",
    state = 'REFUNDED'
FROM speed_roulette."speed_roulette_bet" srb
WHERE srb."gameId" = '<gameId>'
  AND b.id = srb."betId"
  AND b."settledAt" IS NULL;

-- review then commit
-- COMMIT;
-- ROLLBACK;
EOF

Always run inside a transaction with explicit COMMIT / ROLLBACK after eyeballing the result. The exact column names and refund semantics may need to be checked against the live schema — see ../data-model/erd-speed-roulette.md. Before running this in production, dry-run on a snapshot.

D. Nuclear option — wipe the state queue¶

If §Fix A doesn't unwedge:

docker exec ebit-redis redis-cli -a cache <<'EOF'
DEL bull:speed-roulette:state:wait
DEL bull:speed-roulette:state:active
DEL bull:speed-roulette:state:delayed
DEL bull:speed-roulette:state:failed
DEL bull:speed-roulette:state:completed
EOF
docker compose restart ebit-speed-roulette

This loses any in-flight job state. Run §Fix C for any bets stranded by the wipe.

Verification¶

After applying a fix:

Queue depth: bull:speed-roulette:state:active LLEN cycles between 0 and 1; delayed ZCARD has one entry between transitions. failed ZCARD doesn't grow.
Next round starts within 30 s: a fresh speed_roulette_game row appears with state = ACCEPTING_BETS.
WS broadcast resumes: open the dropbet UI, confirm the timer is counting down again. Server-side: {service_name="ebit-rt"} |= "SpeedRouletteStateUpdate" shows new emits in Loki.
End-to-end bet succeeds: place a small bet via the UI; confirm 201, balance change, and the bet appears in latestGames after FINISHED.
No further failed jobs for 5 minutes: bull:speed-roulette:state:failed ZCARD stays flat.

Prevention¶

Per-job timeout: BullMQ Job.opts does not currently set timeout on speed-roulette state jobs (addStateJob, roulette-state.processor.ts:147-160). Adding a timeout (e.g., lockDuration: 30_000) so a wedged job auto-fails before exhausting retries is a code change — file follow-up referencing {{TBD: ADR for speed-roulette job-timeout policy — currently not authored, candidate ADR}}.
Alert on bull:speed-roulette:state queue depth > 5 for > 60 s — that's outside the steady-state band of 0–1.
Alert on consecutive Roulette job ".*" failed after 10 attempts log lines — the chain is about to die.
Alert on speed_roulette_game.state = ACCEPTING_BETS rows older than 30 s — the round didn't advance.
Bootstrap sweep cron (defensive): a periodic job that checks "is there a live state-queue entry?" and enqueues first if not. Currently only onApplicationBootstrap does this — a crash + non-empty-queue combination defeats it. File follow-up.
EOS provider monitoring: WAITING_BLOCK is a hard dependency. Track EOS block production lag separately; alert before it cascades into stuck rounds.

Cross-references¶

../flows/dropbet-speed-roulette.md — full round design, sequence diagram, EOS dependency
../data-model/erd-speed-roulette.md — schema for the manual-settle SQL in §Fix C
../adr/0005-no-traceparent-on-redis-rpc.md — why downstream calls from speed-roulette appear as orphan traces
bullmq-job-stuck.md — generic BullMQ triage; this runbook handles the speed-roulette-specific case
../handover/oncall-runbook.md — first-response procedure (this is a P1 per the severity table)