Skip to content

Runbook: BullMQ job isn't running

Symptom

A background job (bet settlement, session update, leaderboard update, bot action, etc.) appears to be stuck. The user action completed but side-effects (leaderboard update, email, stat increment) didn't happen.

Likely causes

  1. Job is in waiting state but no processor is running (service crashed or not started)
  2. Job failed and exhausted retries — sitting in failed state
  3. Job is in active state but the processor is blocked (Prisma transaction timeout, Redis disconnect)
  4. Wrong Redis instance — some queues use cache (6379), others use bot (6380)
  5. Processor threw an unhandled exception and the worker died

Diagnosis

1. Identify the queue and Redis instance

Queue Redis Port
update-session, bet_settled_queue, update-user-stats, migrate-user-stats, leaderboard_queue, SKINDECK_DEPOSIT, promo-expired, speed-roulette-* cache 6379
bots-bet, bots-session-scheduler, bots-start-session, challenges bot 6380

2. Inspect queue state in Redis

# Connect to the right Redis instance
redis-cli -a cache -p 6379    # cache queues
redis-cli -a bot -p 6380      # bot queues

# List all BullMQ queue namespaces
KEYS bull:*:meta

# Check queue depth by state
LLEN bull:<queue>:wait          # waiting jobs
LLEN bull:<queue>:active        # currently processing
ZCARD bull:<queue>:delayed      # delayed (scheduled)
ZCARD bull:<queue>:failed       # failed after retries

# Inspect a specific job
LRANGE bull:<queue>:wait 0 0    # get first waiting job ID
HGETALL bull:<queue>:<jobId>    # full job payload + metadata

3. Check processor logs in Loki

# Search for queue-related logs
curl -s -G "http://localhost:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={service_name="ebit-api"} |= "<queue-name>"' \
  --data-urlencode 'limit=10' | python3 -m json.tool

# Or in Grafana: Logs-Trace Pivot dashboard → set $service to ebit-api
# and search for the queue name in the log body

4. Check if the processor is running

# Verify the owning service is up
sudo docker ps | grep ebit-api   # most queues
sudo docker ps | grep ebit-speed-roulette   # speed-roulette queues

# Check for crash loops
sudo docker logs --tail 50 ebit-api 2>&1 | grep -iE "error|crash|unhandled|bull"

Fix

Job stuck in waiting — processor not running

# Restart the owning service
sudo docker compose restart ebit-api
# Or for speed-roulette queues:
sudo docker compose restart ebit-speed-roulette

Job stuck in failed — retry manually

redis-cli -a cache -p 6379
# Move failed job back to waiting
LREM bull:<queue>:failed 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
# Reset job state
HSET bull:<queue>:<jobId> state waiting

Job stuck in active — processor blocked

# Check if the job has been active for too long
HGET bull:<queue>:<jobId> processedOn
# If processedOn is old (> 5 minutes), the processor likely died mid-job

# Force-fail the stale active job
LREM bull:<queue>:active 0 <jobId>
LPUSH bull:<queue>:wait <jobId>
HSET bull:<queue>:<jobId> state waiting

# Restart the service to clear the stale worker
sudo docker compose restart ebit-api

Nuclear option — drain and reset a queue

redis-cli -a cache -p 6379
# WARNING: this deletes all jobs in the queue
DEL bull:<queue>:wait bull:<queue>:active bull:<queue>:delayed bull:<queue>:failed bull:<queue>:completed bull:<queue>:meta bull:<queue>:id

Prevention

  • Monitor bullmq_queue_jobs{state="failed"} in the BullMQ Grafana dashboard — alert on > 0
  • Check bullmq_queue_jobs{state="waiting"} trending upward — indicates processors can't keep up
  • All queue processors should have explicit error handling and logging in onFailed() callbacks
  • The bet_settled_queue processor (bet.queue-processor.ts) uses @PrismaTransactional — if the Prisma transaction times out, the job fails and retries. Check Prisma/Postgres dashboard for slow queries.