Skip to content

Runbook: Redis under memory pressure

Symptom

Redis (cache instance, :6379, password cache) is approaching its memory limit, evicting keys, or refusing writes. Visible as one or more of:

  • perf-system Grafana dashboard: container_memory_rss{name="ebit-redis"} climbing toward (or above) mem_limit.
  • Loki: {service_name=~"ebit-.*"} |= "OOM command not allowed" or |= "ENOSPC" on Redis writes.
  • BullMQ jobs failing with OOM errors in the bullmq Grafana dashboard.
  • Cache hit ratio collapsing as allkeys-lru evicts hot keys.
  • MEMORY USAGE reporting bytes-used near maxmemory (or near container mem_limit if maxmemory isn't set).

The local stack runs redis/redis-stack:latest with no --maxmemory or --maxmemory-policy configured in docker-compose.yml (REDIS_ARGS: "--requirepass cache --notify-keyspace-events KEA"). That means Redis grows unbounded until the container hits its cgroup mem_limit and is OOM-killed. This runbook covers both "approaching the cgroup ceiling" and "approaching a configured maxmemory".

This runbook is the cache instance only. The bot Redis (:6380, password bot) follows the same procedure with different connection details — see bullmq-job-stuck.md §1 for the queue→instance map.

Likely causes

  1. BullMQ queue depth growth — a processor is failing every job and removeOnFail retention isn't bounded; failed-job hashes accumulate as bull:<queue>:<jobId>.
  2. Session store unboundeduser:login-attempts:*, recaptcha:* idempotency, or socket:* keys without TTL.
  3. Redis Stack indicesredis/redis-stack ships RedisJSON + RediSearch. If an index was created without a memory bound, it grows with every write to its tracked key set.
  4. Online-users zset growthONLINE_USERS_KEY accumulates entries; the rt service relies on TTL sweep, not active eviction (see ../flows/rt-websocket.md §6.3).
  5. MAXMEMORY not set — the local stack and any deployed environment that didn't override REDIS_ARGS will grow until container OOM. Production-shaped sizing is {{TBD}}.

Detection

  • Grafana — redis dashboard: ops/sec, memory usage, eviction rate. Healthy: < 70% of maxmemory (or container limit) at peak; eviction rate flat.
  • Grafana — perf-system: Container Memory RSS panel for ebit-redis. Climb without a corresponding ops/sec spike means key accumulation, not load.
  • Loki: {service_name="ebit-api"} |= "OOM command not allowed when used memory > 'maxmemory'".
  • Alert: Container Memory RSS > 80% of mem_limit for > 5 min; eviction rate > 0/sec for > 5 min on a normally-no-eviction instance.

Triage

1. Snapshot memory state

docker exec ebit-redis redis-cli -a cache INFO memory

Read fields:

  • used_memory_human — current bytes (Redis-side, excludes process overhead).
  • used_memory_rss_human — RSS as Redis sees it (close to cgroup measurement).
  • maxmemory_human — the configured cap. 0B means no cap is set — the only ceiling is the container's mem_limit.
  • maxmemory_policynoeviction (default) means writes start failing at the cap. allkeys-lru evicts cold keys.
  • mem_fragmentation_ratio — > 1.5 indicates significant fragmentation; consider a restart.

2. Find the heaviest key prefixes

docker exec ebit-redis redis-cli -a cache --bigkeys

The output lists biggest single keys per type. For prefix-level totals, sample with:

docker exec ebit-redis redis-cli -a cache --scan --pattern 'bull:*' | head -100 | \
  xargs -I{} docker exec ebit-redis redis-cli -a cache MEMORY USAGE {} | \
  awk '{s+=$1} END {print s/1024/1024 " MB sampled across 100 bull:* keys"}'

Repeat for user:*, socket:*, cache:*, recaptcha:*, plus any custom prefix you suspect. Common offenders by raw byte count are usually bull:* and cache:*.

3. Identify keys without TTL (most leak-prone)

docker exec ebit-redis redis-cli -a cache --scan --pattern '*' | head -1000 | \
  while read k; do
    ttl=$(docker exec ebit-redis redis-cli -a cache TTL "$k")
    [ "$ttl" = "-1" ] && echo "$k"
  done | head -50

TTL = -1 means the key has no expiry. Sessions, idempotency locks, and per-user counters should have TTL; queue jobs and config caches usually shouldn't. Anything unexpectedly persistent is a candidate fix target.

4. Confirm whether eviction is actually happening

docker exec ebit-redis redis-cli -a cache INFO stats | grep evicted_keys

Non-zero and rising = eviction is in progress. Combined with noeviction policy, this should be impossible — if you see it, the policy was probably overridden via CONFIG SET at runtime.

Fix

A. Set maxmemory-policy allkeys-lru (immediate, runtime)

If memory pressure is genuine and writes are failing with OOM command not allowed, switching to LRU eviction lets Redis trim cold keys instead of refusing writes:

docker exec ebit-redis redis-cli -a cache CONFIG SET maxmemory-policy allkeys-lru
docker exec ebit-redis redis-cli -a cache CONFIG SET maxmemory 1gb   # adjust to fit

This is non-persistent — restart loses it. Persist by adding --maxmemory 1gb --maxmemory-policy allkeys-lru to REDIS_ARGS in docker-compose.yml (or the equivalent in production deployment config) and redeploying.

Caveat: BullMQ data lives in Redis. Aggressive LRU on a Redis that holds queues will silently drop jobs. For BullMQ-heavy deployments, prefer volatile-lru (only evicts keys with TTL) and ensure BullMQ keys are TTL'd via removeOnComplete / removeOnFail.

B. Drain stale BullMQ queues

If bull:* is the dominant prefix (§Triage 2), follow bullmq-job-stuck.md §Fix to drain the offending queue. Common quick wins:

# Inspect failed-job retention per queue
for q in bet_settled_queue update-session leaderboard_queue; do
  echo -n "$q failed: "
  docker exec ebit-redis redis-cli -a cache LLEN "bull:$q:failed"
  echo -n "$q completed: "
  docker exec ebit-redis redis-cli -a cache LLEN "bull:$q:completed"
done

If completed is unbounded, the queue config is missing removeOnComplete — file a follow-up.

C. Add TTL to leaking keys

Once §Triage 3 has identified keys that should have a TTL but don't, the fix is at the application layer — add an EXPIRE to the producer. Apply a one-off retroactive TTL to existing keys:

docker exec ebit-redis redis-cli -a cache --scan --pattern 'recaptcha:*' | \
  while read k; do
    docker exec ebit-redis redis-cli -a cache EXPIRE "$k" 300
  done

D. Restart the container with a higher mem_limit (legitimate growth)

If memory growth tracks legitimate user growth and is not a leak, raise the container ceiling. Edit docker-compose.yml:

ebit-redis:
  image: redis/redis-stack:latest
  mem_limit: 4g
  environment:
    REDIS_ARGS: "--requirepass cache --notify-keyspace-events KEA --maxmemory 3500mb --maxmemory-policy allkeys-lru"

Then docker compose up -d ebit-redis. Each MB of RSS costs ~MB of host memory; size the host accordingly. For the production sizing: {{TBD: production Redis sizing — currently no production deployment definition checked in. Engineering team to author once production stack is finalized}}.

E. Defragment without restart (if mem_fragmentation_ratio > 1.5)

docker exec ebit-redis redis-cli -a cache CONFIG SET activedefrag yes
# Re-check after 5 min:
docker exec ebit-redis redis-cli -a cache INFO memory | grep frag

Disable when mem_fragmentation_ratio is back below 1.2 (defrag itself costs CPU).

Verification

After applying a fix:

  1. Memory % drops back below 70% of maxmemory / mem_limit. Watch the redis Grafana dashboard memory panel for a clean down-slope.
  2. No OOM in the next 10 minutes: {service_name=~"ebit-.*"} |= "OOM" returns zero matches.
  3. Eviction rate returns to zero on a normally-no-eviction instance: INFO stats | grep evicted_keys shows the counter holding flat.
  4. BullMQ depths stable at expected steady-state (no queue pegged at the ceiling).
  5. End-to-end smoke: place a bet via the UI; confirm cache reads (e.g., GET /accounting/balances) hit the cache layer normally.

Prevention

  • Set maxmemory and maxmemory-policy explicitly in compose REDIS_ARGS. Default noeviction + no cap = unbounded growth + sudden write failure when host runs out of memory. Currently absent — ADR candidate: {{TBD: ADR for Redis memory cap policy}}.
  • Alert at 80% of mem_limit (warning) and 90% (page). Both thresholds in the redis dashboard.
  • TTL audit in code review: every new key written to Redis must declare its TTL strategy. The decorator pattern in apps/api/src/captcha/google/recaptcha.service.ts (@IdempotencyLock({ lockTtl })) is the example to follow.
  • BullMQ retention discipline: every queue declares removeOnComplete and removeOnFail with explicit count or age. The speed-roulette state queue uses removeOnComplete: { age: 30 } (apps/speed-roulette/src/roulette/state/roulette-state.processor.ts:156-158) as the reference pattern.
  • Online-users sweep: rely on TTL, not on zrem on disconnect (current behavior, see ../flows/rt-websocket.md §6.3). Confirm the TTL is sized to your tolerable "ghost user" window.

Cross-references