Runbook: Redis under memory pressure¶
Symptom¶
Redis (cache instance, :6379, password cache) is approaching its memory limit, evicting keys, or refusing writes. Visible as one or more of:
perf-systemGrafana dashboard:container_memory_rss{name="ebit-redis"}climbing toward (or above)mem_limit.- Loki:
{service_name=~"ebit-.*"} |= "OOM command not allowed"or|= "ENOSPC"on Redis writes. - BullMQ jobs failing with
OOMerrors in thebullmqGrafana dashboard. - Cache hit ratio collapsing as
allkeys-lruevicts hot keys. MEMORY USAGEreporting bytes-used nearmaxmemory(or near containermem_limitifmaxmemoryisn't set).
The local stack runs redis/redis-stack:latest with no --maxmemory or --maxmemory-policy configured in docker-compose.yml (REDIS_ARGS: "--requirepass cache --notify-keyspace-events KEA"). That means Redis grows unbounded until the container hits its cgroup mem_limit and is OOM-killed. This runbook covers both "approaching the cgroup ceiling" and "approaching a configured maxmemory".
This runbook is the cache instance only. The bot Redis (:6380, password bot) follows the same procedure with different connection details — see bullmq-job-stuck.md §1 for the queue→instance map.
Likely causes¶
- BullMQ queue depth growth — a processor is failing every job and
removeOnFailretention isn't bounded; failed-job hashes accumulate asbull:<queue>:<jobId>. - Session store unbounded —
user:login-attempts:*,recaptcha:*idempotency, orsocket:*keys without TTL. - Redis Stack indices —
redis/redis-stackships RedisJSON + RediSearch. If an index was created without a memory bound, it grows with every write to its tracked key set. - Online-users zset growth —
ONLINE_USERS_KEYaccumulates entries; the rt service relies on TTL sweep, not active eviction (see../flows/rt-websocket.md§6.3). MAXMEMORYnot set — the local stack and any deployed environment that didn't overrideREDIS_ARGSwill grow until container OOM. Production-shaped sizing is{{TBD}}.
Detection¶
- Grafana —
redisdashboard: ops/sec, memory usage, eviction rate. Healthy: < 70% ofmaxmemory(or container limit) at peak; eviction rate flat. - Grafana —
perf-system: Container Memory RSS panel forebit-redis. Climb without a corresponding ops/sec spike means key accumulation, not load. - Loki:
{service_name="ebit-api"} |= "OOM command not allowed when used memory > 'maxmemory'". - Alert: Container Memory RSS > 80% of
mem_limitfor > 5 min; eviction rate > 0/sec for > 5 min on a normally-no-eviction instance.
Triage¶
1. Snapshot memory state¶
Read fields:
used_memory_human— current bytes (Redis-side, excludes process overhead).used_memory_rss_human— RSS as Redis sees it (close to cgroup measurement).maxmemory_human— the configured cap.0Bmeans no cap is set — the only ceiling is the container'smem_limit.maxmemory_policy—noeviction(default) means writes start failing at the cap.allkeys-lruevicts cold keys.mem_fragmentation_ratio— > 1.5 indicates significant fragmentation; consider a restart.
2. Find the heaviest key prefixes¶
The output lists biggest single keys per type. For prefix-level totals, sample with:
docker exec ebit-redis redis-cli -a cache --scan --pattern 'bull:*' | head -100 | \
xargs -I{} docker exec ebit-redis redis-cli -a cache MEMORY USAGE {} | \
awk '{s+=$1} END {print s/1024/1024 " MB sampled across 100 bull:* keys"}'
Repeat for user:*, socket:*, cache:*, recaptcha:*, plus any custom prefix you suspect. Common offenders by raw byte count are usually bull:* and cache:*.
3. Identify keys without TTL (most leak-prone)¶
docker exec ebit-redis redis-cli -a cache --scan --pattern '*' | head -1000 | \
while read k; do
ttl=$(docker exec ebit-redis redis-cli -a cache TTL "$k")
[ "$ttl" = "-1" ] && echo "$k"
done | head -50
TTL = -1 means the key has no expiry. Sessions, idempotency locks, and per-user counters should have TTL; queue jobs and config caches usually shouldn't. Anything unexpectedly persistent is a candidate fix target.
4. Confirm whether eviction is actually happening¶
Non-zero and rising = eviction is in progress. Combined with noeviction policy, this should be impossible — if you see it, the policy was probably overridden via CONFIG SET at runtime.
Fix¶
A. Set maxmemory-policy allkeys-lru (immediate, runtime)¶
If memory pressure is genuine and writes are failing with OOM command not allowed, switching to LRU eviction lets Redis trim cold keys instead of refusing writes:
docker exec ebit-redis redis-cli -a cache CONFIG SET maxmemory-policy allkeys-lru
docker exec ebit-redis redis-cli -a cache CONFIG SET maxmemory 1gb # adjust to fit
This is non-persistent — restart loses it. Persist by adding --maxmemory 1gb --maxmemory-policy allkeys-lru to REDIS_ARGS in docker-compose.yml (or the equivalent in production deployment config) and redeploying.
Caveat: BullMQ data lives in Redis. Aggressive LRU on a Redis that holds queues will silently drop jobs. For BullMQ-heavy deployments, prefer volatile-lru (only evicts keys with TTL) and ensure BullMQ keys are TTL'd via removeOnComplete / removeOnFail.
B. Drain stale BullMQ queues¶
If bull:* is the dominant prefix (§Triage 2), follow bullmq-job-stuck.md §Fix to drain the offending queue. Common quick wins:
# Inspect failed-job retention per queue
for q in bet_settled_queue update-session leaderboard_queue; do
echo -n "$q failed: "
docker exec ebit-redis redis-cli -a cache LLEN "bull:$q:failed"
echo -n "$q completed: "
docker exec ebit-redis redis-cli -a cache LLEN "bull:$q:completed"
done
If completed is unbounded, the queue config is missing removeOnComplete — file a follow-up.
C. Add TTL to leaking keys¶
Once §Triage 3 has identified keys that should have a TTL but don't, the fix is at the application layer — add an EXPIRE to the producer. Apply a one-off retroactive TTL to existing keys:
docker exec ebit-redis redis-cli -a cache --scan --pattern 'recaptcha:*' | \
while read k; do
docker exec ebit-redis redis-cli -a cache EXPIRE "$k" 300
done
D. Restart the container with a higher mem_limit (legitimate growth)¶
If memory growth tracks legitimate user growth and is not a leak, raise the container ceiling. Edit docker-compose.yml:
ebit-redis:
image: redis/redis-stack:latest
mem_limit: 4g
environment:
REDIS_ARGS: "--requirepass cache --notify-keyspace-events KEA --maxmemory 3500mb --maxmemory-policy allkeys-lru"
Then docker compose up -d ebit-redis. Each MB of RSS costs ~MB of host memory; size the host accordingly. For the production sizing: {{TBD: production Redis sizing — currently no production deployment definition checked in. Engineering team to author once production stack is finalized}}.
E. Defragment without restart (if mem_fragmentation_ratio > 1.5)¶
docker exec ebit-redis redis-cli -a cache CONFIG SET activedefrag yes
# Re-check after 5 min:
docker exec ebit-redis redis-cli -a cache INFO memory | grep frag
Disable when mem_fragmentation_ratio is back below 1.2 (defrag itself costs CPU).
Verification¶
After applying a fix:
- Memory % drops back below 70% of
maxmemory/mem_limit. Watch theredisGrafana dashboard memory panel for a clean down-slope. - No
OOMin the next 10 minutes:{service_name=~"ebit-.*"} |= "OOM"returns zero matches. - Eviction rate returns to zero on a normally-no-eviction instance:
INFO stats | grep evicted_keysshows the counter holding flat. - BullMQ depths stable at expected steady-state (no queue pegged at the ceiling).
- End-to-end smoke: place a bet via the UI; confirm cache reads (e.g.,
GET /accounting/balances) hit the cache layer normally.
Prevention¶
- Set
maxmemoryandmaxmemory-policyexplicitly in composeREDIS_ARGS. Defaultnoeviction+ no cap = unbounded growth + sudden write failure when host runs out of memory. Currently absent — ADR candidate:{{TBD: ADR for Redis memory cap policy}}. - Alert at 80% of
mem_limit(warning) and 90% (page). Both thresholds in theredisdashboard. - TTL audit in code review: every new key written to Redis must declare its TTL strategy. The decorator pattern in
apps/api/src/captcha/google/recaptcha.service.ts(@IdempotencyLock({ lockTtl })) is the example to follow. - BullMQ retention discipline: every queue declares
removeOnCompleteandremoveOnFailwith explicitcountorage. The speed-roulette state queue usesremoveOnComplete: { age: 30 }(apps/speed-roulette/src/roulette/state/roulette-state.processor.ts:156-158) as the reference pattern. - Online-users sweep: rely on TTL, not on
zremon disconnect (current behavior, see../flows/rt-websocket.md§6.3). Confirm the TTL is sized to your tolerable "ghost user" window.
Cross-references¶
bullmq-job-stuck.md— when the offender is a stuck queuedb-down.md— Redis sits next to Postgres in the dependency chain; their failure modes interact under load../flows/rt-websocket.md—ONLINE_USERS_KEYzset and TTL sweep../observability.md— Redis spanmetrics / ioredis instrumentation../adr/0003-bullmq-not-rabbitmq.md— why everything async lives in Redis