Runbook: Speed-roulette round stuck¶
Symptom¶
The shared speed-roulette round doesn't advance. Visible as one or more of:
- Players in the speed-roulette UI see the timer frozen on
ACCEPTING_BETS(or any other state) for > 90 seconds. - WebSocket
SpeedRouletteStateUpdatebroadcasts stop firing — confirm in browser DevTools → Network → WS tab. bullmqGrafana dashboard showsspeed-roulette:statequeue withactive=0anddelayed=0— no job is queued, the loop has stopped.bullmqdashboard showsspeed-roulette:statequeue withfailed > 0and a job that exhausted itsattempts: 10retries.- p95 of "round duration" exceeds the design budget of ~27 s for more than two consecutive rounds.
The round design and lifecycle are documented in ../flows/dropbet-speed-roulette.md §1 (timeConfig). The state queue is a single BullMQ worker with concurrency: 1 (apps/speed-roulette/src/roulette/state/roulette-state.processor.ts:23-25).
Likely causes¶
- Job exhausted retries without enqueuing a follow-up — the
process()method ends withaddStateJob(stateModel.getJobData()). Ifprocess()throws on every attempt,attempts: 10runs out;onFailedonly enqueues anERRORstate ifattemptsMade > attempts(off-by-one — see code at lines 72–88). A clean exhaust withoutERRORfollow-up = the chain dies. concurrency: 1deadlock — BullMQ holds the worker on the active job. If the active job is stuck on a remote call (EOS block wait,walletClient.playRPC over Redis pub/sub) the queue can't advance.startIfNotStartedonly bootstraps an empty queue — if the queue has anything (delayed/active/waiting), it returns early. If the chain dies while a job is indelayed, the bootstrap won't help.- EOS provider unreachable —
WAITING_BLOCKwaits for a real or stubbed EOS block. Provider down →EosWaitBlockTimeoutError→ retried; if it never recovers withinattempts: 10, the chain dies. - Speed-roulette container crash mid-job — the active job is parked in
bull:speed-roulette:state:activewith no live worker; the queue blocks until something moves it.
Detection¶
- Grafana —
bullmqdashboard:speed-roulette:statequeue depth panel. Healthy: cycles between 0 and 1. Stuck: stays at 0 with no completed-jobs flow, or stays at 1 (active) withprocessedOnminutes old. - Grafana —
service-overview:ebit-speed-rouletterequest rate drops to zero on round-state transitions while inbound bet RPCs continue (or also drop if all clients leave). - Loki:
{service_name="ebit-speed-roulette"} |= "Roulette job" |= "failed"— exhausted retry messages. - Alert:
bull:speed-roulette:state:activeLLEN > 0 with nobull:speed-roulette:state:completedincrement for > 60 s.
Triage¶
1. Inspect the state-queue contents¶
docker exec ebit-redis redis-cli -a cache <<'EOF'
LLEN bull:speed-roulette:state:wait
LLEN bull:speed-roulette:state:active
ZCARD bull:speed-roulette:state:delayed
ZCARD bull:speed-roulette:state:failed
LRANGE bull:speed-roulette:state:active 0 -1
EOF
Healthy: at any instant exactly one of delayed or active has a single entry, others are zero.
Stuck: all four are zero (chain died), or active holds a job whose processedOn is far in the past.
2. Dump the active job's payload + timestamps¶
ACTIVE_JOB=$(docker exec ebit-redis redis-cli -a cache LRANGE bull:speed-roulette:state:active 0 0 | tr -d '\r')
docker exec ebit-redis redis-cli -a cache HGETALL bull:speed-roulette:state:$ACTIVE_JOB
Inspect:
data— JSON withgameId,state.current(ACCEPTING_BETS/WAITING_BLOCK/ROLLING/FINISHED/ERROR).processedOn— Unix ms when the worker picked it up. Ifnow - processedOn > 60_000, the worker is wedged on this job.attemptsMadevsopts.attempts— close to 10 means exhaustion is imminent; equal to 10 means it just exhausted.failedReason— present on the failed-job path; tells you the exception class (EosWaitBlockTimeoutError,EosCurrentBlockOutdatedError, or generic).
3. Check the in-flight game in Postgres¶
docker exec ebit-db psql -U ebit -d ebit -c "
SELECT id, state, \"createdAt\", \"endedAt\",
\"eosBlockNum\", \"eosBlockId\", roll, color
FROM speed_roulette.\"speed_roulette_game\"
ORDER BY \"createdAt\" DESC
LIMIT 3;"
The newest row is the wedged round. Note its id (you'll need it for §Fix C) and its state. If state = ACCEPTING_BETS and createdAt is > 30 s old, the round never advanced past bet collection. If state = WAITING_BLOCK, EOS is the suspect.
4. Check speed-roulette container health¶
docker compose ps ebit-speed-roulette
docker compose logs --tail=100 ebit-speed-roulette | grep -iE "Roulette|error|failed|EOS"
A crashed container is the simplest case — restart resolves the deadlock (§Fix B).
Fix¶
A. Drain a stuck active job (queue is wedged but worker is alive)¶
ACTIVE_JOB=$(docker exec ebit-redis redis-cli -a cache LRANGE bull:speed-roulette:state:active 0 0 | tr -d '\r')
# Move it to failed (bypasses BullMQ's normal failure path; you accept losing this round)
docker exec ebit-redis redis-cli -a cache LREM bull:speed-roulette:state:active 0 $ACTIVE_JOB
docker exec ebit-redis redis-cli -a cache HSET bull:speed-roulette:state:$ACTIVE_JOB state failed
docker exec ebit-redis redis-cli -a cache LPUSH bull:speed-roulette:state:failed $ACTIVE_JOB
Then restart the worker so startIfNotStarted re-bootstraps an empty queue:
The module's onApplicationBootstrap hook calls startIfNotStarted (roulette.module.ts:49) which only bootstraps when delayed+active+waiting are all zero. The drain above ensures that.
B. Restart the speed-roulette container (worker crashed)¶
docker compose restart ebit-speed-roulette
docker compose logs -f ebit-speed-roulette | sed '/Roulette state is .* active\|Starting game/q'
The sed quits once you see either Roulette state is active (queue picked up where it left off) or Starting game: <id> (fresh bootstrap). If the state queue had contents (active/delayed/waiting), the worker resumes processing them. If empty, startIfNotStarted enqueues a first job.
C. Manually settle in-flight bets for a lost round¶
If you abandoned a round in §Fix A, refund the players who placed bets in that round. Replace <gameId> with the wedged-round ID from §Triage 3:
docker exec ebit-db psql -U ebit -d ebit <<'EOF'
BEGIN;
-- 1. Find affected bets
SELECT b.id, b."userId", b."currencyId", b."betAmount"
FROM speed_roulette."speed_roulette_bet" srb
JOIN public."Bet" b ON b.id = srb."betId"
WHERE srb."gameId" = '<gameId>'
AND b."settledAt" IS NULL;
-- 2. Refund: credit user_balance, mark bet REFUNDED, write transaction row
-- (run after eyeballing the row count above)
UPDATE public."UserBalance" ub
SET amount = amount + b."betAmount"
FROM public."Bet" b, speed_roulette."speed_roulette_bet" srb
WHERE srb."gameId" = '<gameId>'
AND b.id = srb."betId"
AND ub."userId" = b."userId"
AND ub."currencyId" = b."currencyId"
AND b."settledAt" IS NULL;
UPDATE public."Bet" b
SET "settledAt" = now(),
payout = "betAmount",
state = 'REFUNDED'
FROM speed_roulette."speed_roulette_bet" srb
WHERE srb."gameId" = '<gameId>'
AND b.id = srb."betId"
AND b."settledAt" IS NULL;
-- review then commit
-- COMMIT;
-- ROLLBACK;
EOF
Always run inside a transaction with explicit COMMIT / ROLLBACK after eyeballing the result. The exact column names and refund semantics may need to be checked against the live schema — see ../data-model/erd-speed-roulette.md. Before running this in production, dry-run on a snapshot.
D. Nuclear option — wipe the state queue¶
If §Fix A doesn't unwedge:
docker exec ebit-redis redis-cli -a cache <<'EOF'
DEL bull:speed-roulette:state:wait
DEL bull:speed-roulette:state:active
DEL bull:speed-roulette:state:delayed
DEL bull:speed-roulette:state:failed
DEL bull:speed-roulette:state:completed
EOF
docker compose restart ebit-speed-roulette
This loses any in-flight job state. Run §Fix C for any bets stranded by the wipe.
Verification¶
After applying a fix:
- Queue depth:
bull:speed-roulette:state:activeLLEN cycles between 0 and 1;delayedZCARD has one entry between transitions.failedZCARD doesn't grow. - Next round starts within 30 s: a fresh
speed_roulette_gamerow appears withstate = ACCEPTING_BETS. - WS broadcast resumes: open the dropbet UI, confirm the timer is counting down again. Server-side:
{service_name="ebit-rt"} |= "SpeedRouletteStateUpdate"shows new emits in Loki. - End-to-end bet succeeds: place a small bet via the UI; confirm 201, balance change, and the bet appears in
latestGamesafterFINISHED. - No further failed jobs for 5 minutes:
bull:speed-roulette:state:failedZCARD stays flat.
Prevention¶
- Per-job timeout: BullMQ
Job.optsdoes not currently settimeouton speed-roulette state jobs (addStateJob,roulette-state.processor.ts:147-160). Adding a timeout (e.g.,lockDuration: 30_000) so a wedged job auto-fails before exhausting retries is a code change — file follow-up referencing{{TBD: ADR for speed-roulette job-timeout policy — currently not authored, candidate ADR}}. - Alert on
bull:speed-roulette:statequeue depth > 5 for > 60 s — that's outside the steady-state band of 0–1. - Alert on consecutive
Roulette job ".*" failed after 10 attemptslog lines — the chain is about to die. - Alert on
speed_roulette_game.state = ACCEPTING_BETSrows older than 30 s — the round didn't advance. - Bootstrap sweep cron (defensive): a periodic job that checks "is there a live state-queue entry?" and enqueues
firstif not. Currently onlyonApplicationBootstrapdoes this — a crash + non-empty-queue combination defeats it. File follow-up. - EOS provider monitoring:
WAITING_BLOCKis a hard dependency. Track EOS block production lag separately; alert before it cascades into stuck rounds.
Cross-references¶
../flows/dropbet-speed-roulette.md— full round design, sequence diagram, EOS dependency../data-model/erd-speed-roulette.md— schema for the manual-settle SQL in §Fix C../adr/0005-no-traceparent-on-redis-rpc.md— why downstream calls from speed-roulette appear as orphan tracesbullmq-job-stuck.md— generic BullMQ triage; this runbook handles the speed-roulette-specific case../handover/oncall-runbook.md— first-response procedure (this is a P1 per the severity table)