Runbook: ebit-rt connection saturation / scale-out¶

Symptom¶

The realtime gateway can't accept more websocket clients. Visible as one or more of:

ebit-rt container CPU near 100% sustained.
New websocket clients get Too many connections (per-IP cap) or Too many requests (throttler block) and disconnect.
WS_THROTTLER_BLOCK_DURATION (default 600 000 ms = 10 min) bans appearing in Loki.
ws_active connections plateau at a hard ceiling rather than tracking real traffic.
socket.io handshake EIO=4 requests 503'ing or hanging.
Browser DevTools → Network → WS shows the upgrade failing or repeated reconnects.

The gateway runs as a single container (ebit-rt :4001, namespace /events, websocket-only — polling disabled). All per-instance state (clientSockets Map, WsThrottlerService.ipConnections Map) lives in Node memory. Until a Redis adapter ships, scaling out is broken — see §Cause and AF-3 in the architecture doc.

Cause¶

Two layered ceilings:

MAX_CONNECTIONS_PER_IP — libs/ws-throttler/src/const.ts:3-6. Default 10. Hard-rejects the 11th simultaneous socket from a single IP. Implementation: WsThrottlerService.onConnection increments an in-memory Map<ip, count>; if count + 1 > MAX_CONNECTIONS_PER_IP, emits error: 'Too many connections' and disconnects after 1 s.
WS_THROTTLER_LIMIT — env var, default 120 requests per WS_THROTTLER_TTL (default 60 000 ms). When tripped, WsThrottlerGuard (libs/ws-throttler/src/ws-throttler.guard.ts:39-53) emits error: 'Too many requests', disconnects, and bans the tracker key for WS_THROTTLER_BLOCK_DURATION (default 600 000 ms).

Beyond the throttler, the single replica means:

All sockets share one Node event loop; CPU pegs before connection count alone exhausts kernel limits.
Per-user emit (message.user-targeted) requires the local clientSockets map; a second instance can't see sockets registered on the first.
The @socket.io/redis-adapter is not installed — see ../flows/rt-websocket.md §6, AF-3 in ../architecture.md.

Real connection ceiling per replica is workload-dependent — measure under perf, don't guess. Smoke profile (50 VU, 1 min) leaves headroom; the stepped-ramp profile is the one that finds the wall.

Detection¶

Grafana — service-overview: ebit-rt request rate panel; CPU panel; reconnect rate spike.
Grafana — bullmq: any rt-fed queue depth climbing — when rt can't deliver, downstream BullMQ pushers back-pressure.
Loki:
{service_name="ebit-rt"} |= "Too many connections" — per-IP cap hits.
{service_name="ebit-rt"} |= "Too many requests" — throttler bans.
{service_name="ebit-rt"} |= "WS_THROTTLER_BLOCK_DURATION" — currently blocked client tracker keys.
Alert: rt CPU > 80% for > 2 min; throttler ban count > 0/sec sustained for > 1 min.

Triage¶

1. Confirm rt is the bottleneck (not api)¶

docker compose ps ebit-rt
docker stats --no-stream ebit-rt

CPU near 100% on a single core: rt is the bottleneck. CPU low but connections refused: throttler config issue, not capacity.

2. Inspect throttler state (live config)¶

docker exec ebit-rt sh -c 'env | grep -E "WS_THROTTLER|MAX_CONN"'

Expected baseline:

MAX_CONNECTIONS_PER_IP=10
WS_THROTTLER_DISABLE=false
WS_THROTTLER_TTL=60000
WS_THROTTLER_LIMIT=120
WS_THROTTLER_BLOCK_DURATION=600000

Any diff from baseline is your first hypothesis (env override, recent deploy).

3. Distinguish legitimate load from abuse¶

Sample IP distribution from rt logs over the last 5 min:

docker logs --tail 5000 ebit-rt 2>&1 | \
  grep -oE '"clientIp":"[^"]+"' | sort | uniq -c | sort -rn | head -20

Pattern	Interpretation
Top IPs each at < 10 connections, broad spread of IPs	Legitimate user growth
One IP with 11+ rejections	Single misbehaving client (browser bug, VPN concentrator, NAT egress)
Few IPs concentrating thousands of attempts	Abuse / DoS
Steady throttler bans on the same tracker keys	Block duration is hiding a chronic offender

4. Inspect the active socket count¶

docker exec ebit-rt sh -c '
node -e "
require(\"http\").get(\"http://localhost:4001/socket.io/?EIO=4\", r => {
  console.log(\"handshake\", r.statusCode);
  r.resume();
});
"'

A 200 confirms the gateway is still accepting handshakes; non-200 means the gateway itself is sad, not just the throttler.

5. Check per-IP bans in storage¶

The throttler uses redis-throttler-storage.ts — bans are keys in cache Redis (:6379):

docker exec ebit-redis redis-cli -a cache --scan --pattern 'throttle:*' | head -20
docker exec ebit-redis redis-cli -a cache TTL throttle:<one-of-them>

This tells you who is currently banned and how long until they clear.

Fix¶

A. Short-term — raise per-instance throttler limits via Doppler (test first)¶

Use only when traffic is confirmed legitimate (§Triage 3 shows broad IP spread).

doppler secrets set --project ebit --config <env> WS_THROTTLER_LIMIT=240
doppler secrets set --project ebit --config <env> MAX_CONNECTIONS_PER_IP=20
docker compose restart ebit-rt

Caveats:

Doubling WS_THROTTLER_LIMIT doesn't double rt capacity — CPU is the true ceiling. Watch CPU after the bump.
Raising MAX_CONNECTIONS_PER_IP accommodates corporate NAT / mobile carrier concentrators but also gives an abuser more room. Keep it bounded.
Test the new limits against the perf profile before committing — tests-perf/k6/scenarios/ws-storm.js is the right exercise.

B. Medium-term — horizontal scale-out (requires Redis adapter, code change)¶

Currently blocked. Required to ship:

npm install @socket.io/redis-adapter in ebit-api/.
Wire the adapter in apps/rt/src/main.ts so emits fan out across replicas via cache Redis pub/sub.
Move WsThrottlerService.ipConnections from in-process Map to a Redis-backed sorted set (the throttler guard already uses a Redis storage; the per-IP connection counter does not).
Verify that clientSockets (per-user delivery) still works — this is the highest-risk change.
ADR required: {{TBD: ADR for socket.io Redis adapter and per-IP counter migration — currently not authored}}. Reference ../adr/ for the format.
Roll out behind a flag; canary one replica before full multi-replica.

The architectural rationale and the gap are documented in ../flows/rt-websocket.md §6 and AF-3 in ../architecture.md. This is the durable fix; the short-term lever above only buys time.

C. Hot-fix — `iptables` rate-limit at the host (confirmed abuse only)¶

If §Triage 3 identifies a small set of abusive IPs and edge protection (CDN / WAF) is unavailable, drop at the host:

# Drop new connections from <abuse-ip> to the rt port
sudo iptables -I INPUT -p tcp --dport 4001 -s <abuse-ip> -j DROP

# Or rate-limit (allow burst of 20, drop > 5/s sustained)
sudo iptables -I INPUT -p tcp --dport 4001 -m state --state NEW \
  -m recent --set
sudo iptables -I INPUT -p tcp --dport 4001 -m state --state NEW \
  -m recent --update --seconds 1 --hitcount 5 -j DROP

Reverse with sudo iptables -D INPUT .... Document every rule you add in the incident channel; remove them at incident close.

In production the right home for these rules is the WAF / security group, not host iptables — {{TBD: production edge rate-limit procedure — depends on customer-team CDN/WAF choice (mirrors the same TBD in the master incident runbook)}}.

D. Reduce push volume (defensive, no code change)¶

Some pushes are best-effort and can be dropped under load. Specifically the UsersOnlineUpdated broadcast (10s cadence, see ../flows/rt-websocket.md §6.3) and any leaderboard tickers can be temporarily silenced via env flag — {{TBD: confirm env flag exists; if not, this option is unavailable until shipped}}.

Verification¶

After applying a fix:

Connection success rate > 99% on fresh handshakes: open dropbet in 5 incognito tabs over 30s; all should connect.
p95 handshake latency < 200 ms (Grafana service-overview rt panel). Beyond 500 ms indicates the gateway is still saturated.
No Too many connections / Too many requests errors in Loki for the next 5 min after the fix lands.
CPU returns to < 70% sustained on the rt container — docker stats ebit-rt.
Throttler ban count drops to zero: docker exec ebit-redis redis-cli -a cache --scan --pattern 'throttle:*' | wc -l.
End-to-end smoke: place a bet via the UI; confirm WS-pushed BalanceUpdated (or equivalent) arrives within ~1s.

Prevention¶

Capacity planning: every release cuts a perf run that includes ws-storm.js. Track p95 handshake + active-connections-per-replica trend across releases. The ceiling is the data, not the guess.
Alert on ws active connections at 80% of measured ceiling: needs the ceiling first (perf-derived). File a follow-up.
Monitor BullMQ queue depth as a ws push proxy: if bull:*-broadcast:wait climbs > 100, the gateway is dropping or back-pressuring. Currently no Grafana alert wired — {{TBD: file alert for bullmq broadcast queue depth > 100 for > 60 s}}.
Ship the Redis adapter (§Fix B). Until that lands, ebit-rt is single-replica and every Black Friday-shaped event puts you in this runbook.
Edge rate-limit on :4001 at the CDN / WAF layer — same hygiene as auth endpoints. Currently not configured.

Cross-references¶

../flows/rt-websocket.md — full rt design, the AF-3 scale-out gap, online-count details
../architecture.md — AF-3 in the known-weakness register
../adr/ — ADR home for the Redis-adapter follow-up
bullmq-job-stuck.md — when ws back-pressure shows up downstream as queue depth
redis-memory-pressure.md — adapter pub/sub increases cache Redis load; sizing matters
../handover/oncall-runbook.md §3 (WebSocket handshake storm) — first-response procedure that brought you here