Runbook: BullMQ job isn't running¶

Symptom¶

A background job (bet settlement, session update, leaderboard update, bot action, etc.) appears to be stuck. The user action completed but side-effects (leaderboard update, email, stat increment) didn't happen.

Likely causes¶

Job is in waiting state but no processor is running (service crashed or not started)
Job failed and exhausted retries — sitting in failed state
Job is in active state but the processor is blocked (Prisma transaction timeout, Redis disconnect)
Wrong Redis instance — some queues use cache (6379), others use bot (6380)
Processor threw an unhandled exception and the worker died

Diagnosis¶

1. Identify the queue and Redis instance¶

Queue	Redis	Port
`update-session`, `bet_settled_queue`, `update-user-stats`, `migrate-user-stats`, `leaderboard_queue`, `SKINDECK_DEPOSIT`, `promo-expired`, `speed-roulette-*`	cache	6379
`bots-bet`, `bots-session-scheduler`, `bots-start-session`, `challenges`	bot	6380

2. Inspect queue state in Redis¶

# Connect to the right Redis instance
redis-cli -a cache -p 6379    # cache queues
redis-cli -a bot -p 6380      # bot queues

# List all BullMQ queue namespaces
KEYS bull:*:meta

# Check queue depth by state
LLEN bull:<queue>:wait          # waiting jobs
LLEN bull:<queue>:active        # currently processing
ZCARD bull:<queue>:delayed      # delayed (scheduled)
ZCARD bull:<queue>:failed       # failed after retries

# Inspect a specific job
LRANGE bull:<queue>:wait 0 0    # get first waiting job ID
HGETALL bull:<queue>:<jobId>    # full job payload + metadata

3. Check processor logs in Loki¶

# Search for queue-related logs
curl -s -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={service_name="ebit-api"} |= "<queue-name>"' \
  --data-urlencode 'limit=10' | python3 -m json.tool

# Or in Grafana: Logs-Trace Pivot dashboard → set $service to ebit-api
# and search for the queue name in the log body

4. Check if the processor is running¶

# Verify the owning service is up
sudo docker ps | grep ebit-api   # most queues
sudo docker ps | grep ebit-speed-roulette   # speed-roulette queues

# Check for crash loops
sudo docker logs --tail 50 ebit-api 2>&1 | grep -iE "error|crash|unhandled|bull"

Fix¶

Job stuck in `waiting` — processor not running¶

# Restart the owning service
sudo docker compose restart ebit-api
# Or for speed-roulette queues:
sudo docker compose restart ebit-speed-roulette

Job stuck in `failed` — retry manually¶

redis-cli -a cache -p 6379
# Move failed job back to waiting
LREM bull:<queue>:failed 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
# Reset job state
HSET bull:<queue>:<jobId> state waiting

Job stuck in `active` — processor blocked¶

# Check if the job has been active for too long
HGET bull:<queue>:<jobId> processedOn
# If processedOn is old (> 5 minutes), the processor likely died mid-job

# Force-fail the stale active job
LREM bull:<queue>:active 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
HSET bull:<queue>:<jobId> state waiting

# Restart the service to clear the stale worker
sudo docker compose restart ebit-api

Nuclear option — drain and reset a queue¶

redis-cli -a cache -p 6379
# WARNING: this deletes all jobs in the queue
DEL bull:<queue>:wait bull:<queue>:active bull:<queue>:delayed bull:<queue>:failed bull:<queue>:completed bull:<queue>:meta bull:<queue>:id

Prevention¶

Monitor bullmq_queue_jobs{state="failed"} in the BullMQ Grafana dashboard — alert on > 0
Check bullmq_queue_jobs{state="waiting"} trending upward — indicates processors can't keep up
All queue processors should have explicit error handling and logging in onFailed() callbacks
The bet_settled_queue processor (bet.queue-processor.ts) uses @PrismaTransactional — if the Prisma transaction times out, the job fails and retries. Check Prisma/Postgres dashboard for slow queries.