Runbook: BullMQ job isn't running¶
Symptom¶
A background job (bet settlement, session update, leaderboard update, bot action, etc.) appears to be stuck. The user action completed but side-effects (leaderboard update, email, stat increment) didn't happen.
Likely causes¶
- Job is in
waitingstate but no processor is running (service crashed or not started) - Job failed and exhausted retries — sitting in
failedstate - Job is in
activestate but the processor is blocked (Prisma transaction timeout, Redis disconnect) - Wrong Redis instance — some queues use cache (6379), others use bot (6380)
- Processor threw an unhandled exception and the worker died
Diagnosis¶
1. Identify the queue and Redis instance¶
| Queue | Redis | Port |
|---|---|---|
update-session, bet_settled_queue, update-user-stats, migrate-user-stats, leaderboard_queue, SKINDECK_DEPOSIT, promo-expired, speed-roulette-* |
cache | 6379 |
bots-bet, bots-session-scheduler, bots-start-session, challenges |
bot | 6380 |
2. Inspect queue state in Redis¶
# Connect to the right Redis instance
redis-cli -a cache -p 6379 # cache queues
redis-cli -a bot -p 6380 # bot queues
# List all BullMQ queue namespaces
KEYS bull:*:meta
# Check queue depth by state
LLEN bull:<queue>:wait # waiting jobs
LLEN bull:<queue>:active # currently processing
ZCARD bull:<queue>:delayed # delayed (scheduled)
ZCARD bull:<queue>:failed # failed after retries
# Inspect a specific job
LRANGE bull:<queue>:wait 0 0 # get first waiting job ID
HGETALL bull:<queue>:<jobId> # full job payload + metadata
3. Check processor logs in Loki¶
# Search for queue-related logs
curl -s -G "http://localhost:3100/loki/api/v1/query_range" \
--data-urlencode 'query={service_name="ebit-api"} |= "<queue-name>"' \
--data-urlencode 'limit=10' | python3 -m json.tool
# Or in Grafana: Logs-Trace Pivot dashboard → set $service to ebit-api
# and search for the queue name in the log body
4. Check if the processor is running¶
# Verify the owning service is up
sudo docker ps | grep ebit-api # most queues
sudo docker ps | grep ebit-speed-roulette # speed-roulette queues
# Check for crash loops
sudo docker logs --tail 50 ebit-api 2>&1 | grep -iE "error|crash|unhandled|bull"
Fix¶
Job stuck in waiting — processor not running¶
# Restart the owning service
sudo docker compose restart ebit-api
# Or for speed-roulette queues:
sudo docker compose restart ebit-speed-roulette
Job stuck in failed — retry manually¶
redis-cli -a cache -p 6379
# Move failed job back to waiting
LREM bull:<queue>:failed 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
# Reset job state
HSET bull:<queue>:<jobId> state waiting
Job stuck in active — processor blocked¶
# Check if the job has been active for too long
HGET bull:<queue>:<jobId> processedOn
# If processedOn is old (> 5 minutes), the processor likely died mid-job
# Force-fail the stale active job
LREM bull:<queue>:active 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
HSET bull:<queue>:<jobId> state waiting
# Restart the service to clear the stale worker
sudo docker compose restart ebit-api
Nuclear option — drain and reset a queue¶
redis-cli -a cache -p 6379
# WARNING: this deletes all jobs in the queue
DEL bull:<queue>:wait bull:<queue>:active bull:<queue>:delayed bull:<queue>:failed bull:<queue>:completed bull:<queue>:meta bull:<queue>:id
Prevention¶
- Monitor
bullmq_queue_jobs{state="failed"}in the BullMQ Grafana dashboard — alert on > 0 - Check
bullmq_queue_jobs{state="waiting"}trending upward — indicates processors can't keep up - All queue processors should have explicit error handling and logging in
onFailed()callbacks - The
bet_settled_queueprocessor (bet.queue-processor.ts) uses@PrismaTransactional— if the Prisma transaction times out, the job fails and retries. Check Prisma/Postgres dashboard for slow queries.