Runbook: ebit-rt connection saturation / scale-out¶
Symptom¶
The realtime gateway can't accept more websocket clients. Visible as one or more of:
ebit-rtcontainer CPU near 100% sustained.- New websocket clients get
Too many connections(per-IP cap) orToo many requests(throttler block) and disconnect. WS_THROTTLER_BLOCK_DURATION(default 600 000 ms = 10 min) bans appearing in Loki.- ws_active connections plateau at a hard ceiling rather than tracking real traffic.
socket.iohandshakeEIO=4requests 503'ing or hanging.- Browser DevTools → Network → WS shows the upgrade failing or repeated reconnects.
The gateway runs as a single container (ebit-rt :4001, namespace /events, websocket-only — polling disabled). All per-instance state (clientSockets Map, WsThrottlerService.ipConnections Map) lives in Node memory. Until a Redis adapter ships, scaling out is broken — see §Cause and AF-3 in the architecture doc.
Cause¶
Two layered ceilings:
MAX_CONNECTIONS_PER_IP—libs/ws-throttler/src/const.ts:3-6. Default10. Hard-rejects the 11th simultaneous socket from a single IP. Implementation:WsThrottlerService.onConnectionincrements an in-memoryMap<ip, count>; ifcount + 1 > MAX_CONNECTIONS_PER_IP, emitserror: 'Too many connections'and disconnects after 1 s.WS_THROTTLER_LIMIT— env var, default120requests perWS_THROTTLER_TTL(default 60 000 ms). When tripped,WsThrottlerGuard(libs/ws-throttler/src/ws-throttler.guard.ts:39-53) emitserror: 'Too many requests', disconnects, and bans the tracker key forWS_THROTTLER_BLOCK_DURATION(default 600 000 ms).
Beyond the throttler, the single replica means:
- All sockets share one Node event loop; CPU pegs before connection count alone exhausts kernel limits.
- Per-user emit (
message.user-targeted) requires the localclientSocketsmap; a second instance can't see sockets registered on the first. - The
@socket.io/redis-adapteris not installed — see../flows/rt-websocket.md§6, AF-3 in../architecture.md.
Real connection ceiling per replica is workload-dependent — measure under perf, don't guess. Smoke profile (50 VU, 1 min) leaves headroom; the stepped-ramp profile is the one that finds the wall.
Detection¶
- Grafana —
service-overview:ebit-rtrequest rate panel; CPU panel; reconnect rate spike. - Grafana —
bullmq: any rt-fed queue depth climbing — when rt can't deliver, downstream BullMQ pushers back-pressure. - Loki:
{service_name="ebit-rt"} |= "Too many connections"— per-IP cap hits.{service_name="ebit-rt"} |= "Too many requests"— throttler bans.{service_name="ebit-rt"} |= "WS_THROTTLER_BLOCK_DURATION"— currently blocked client tracker keys.- Alert: rt CPU > 80% for > 2 min; throttler ban count > 0/sec sustained for > 1 min.
Triage¶
1. Confirm rt is the bottleneck (not api)¶
CPU near 100% on a single core: rt is the bottleneck. CPU low but connections refused: throttler config issue, not capacity.
2. Inspect throttler state (live config)¶
Expected baseline:
MAX_CONNECTIONS_PER_IP=10
WS_THROTTLER_DISABLE=false
WS_THROTTLER_TTL=60000
WS_THROTTLER_LIMIT=120
WS_THROTTLER_BLOCK_DURATION=600000
Any diff from baseline is your first hypothesis (env override, recent deploy).
3. Distinguish legitimate load from abuse¶
Sample IP distribution from rt logs over the last 5 min:
docker logs --tail 5000 ebit-rt 2>&1 | \
grep -oE '"clientIp":"[^"]+"' | sort | uniq -c | sort -rn | head -20
| Pattern | Interpretation |
|---|---|
| Top IPs each at < 10 connections, broad spread of IPs | Legitimate user growth |
| One IP with 11+ rejections | Single misbehaving client (browser bug, VPN concentrator, NAT egress) |
| Few IPs concentrating thousands of attempts | Abuse / DoS |
| Steady throttler bans on the same tracker keys | Block duration is hiding a chronic offender |
4. Inspect the active socket count¶
docker exec ebit-rt sh -c '
node -e "
require(\"http\").get(\"http://localhost:4001/socket.io/?EIO=4\", r => {
console.log(\"handshake\", r.statusCode);
r.resume();
});
"'
A 200 confirms the gateway is still accepting handshakes; non-200 means the gateway itself is sad, not just the throttler.
5. Check per-IP bans in storage¶
The throttler uses redis-throttler-storage.ts — bans are keys in cache Redis (:6379):
docker exec ebit-redis redis-cli -a cache --scan --pattern 'throttle:*' | head -20
docker exec ebit-redis redis-cli -a cache TTL throttle:<one-of-them>
This tells you who is currently banned and how long until they clear.
Fix¶
A. Short-term — raise per-instance throttler limits via Doppler (test first)¶
Use only when traffic is confirmed legitimate (§Triage 3 shows broad IP spread).
doppler secrets set --project ebit --config <env> WS_THROTTLER_LIMIT=240
doppler secrets set --project ebit --config <env> MAX_CONNECTIONS_PER_IP=20
docker compose restart ebit-rt
Caveats:
- Doubling
WS_THROTTLER_LIMITdoesn't double rt capacity — CPU is the true ceiling. Watch CPU after the bump. - Raising
MAX_CONNECTIONS_PER_IPaccommodates corporate NAT / mobile carrier concentrators but also gives an abuser more room. Keep it bounded. - Test the new limits against the perf profile before committing —
tests-perf/k6/scenarios/ws-storm.jsis the right exercise.
B. Medium-term — horizontal scale-out (requires Redis adapter, code change)¶
Currently blocked. Required to ship:
npm install @socket.io/redis-adapterinebit-api/.- Wire the adapter in
apps/rt/src/main.tsso emits fan out across replicas via cache Redis pub/sub. - Move
WsThrottlerService.ipConnectionsfrom in-processMapto a Redis-backed sorted set (the throttler guard already uses a Redis storage; the per-IP connection counter does not). - Verify that
clientSockets(per-user delivery) still works — this is the highest-risk change. - ADR required:
{{TBD: ADR for socket.io Redis adapter and per-IP counter migration — currently not authored}}. Reference../adr/for the format. - Roll out behind a flag; canary one replica before full multi-replica.
The architectural rationale and the gap are documented in ../flows/rt-websocket.md §6 and AF-3 in ../architecture.md. This is the durable fix; the short-term lever above only buys time.
C. Hot-fix — iptables rate-limit at the host (confirmed abuse only)¶
If §Triage 3 identifies a small set of abusive IPs and edge protection (CDN / WAF) is unavailable, drop at the host:
# Drop new connections from <abuse-ip> to the rt port
sudo iptables -I INPUT -p tcp --dport 4001 -s <abuse-ip> -j DROP
# Or rate-limit (allow burst of 20, drop > 5/s sustained)
sudo iptables -I INPUT -p tcp --dport 4001 -m state --state NEW \
-m recent --set
sudo iptables -I INPUT -p tcp --dport 4001 -m state --state NEW \
-m recent --update --seconds 1 --hitcount 5 -j DROP
Reverse with sudo iptables -D INPUT .... Document every rule you add in the incident channel; remove them at incident close.
In production the right home for these rules is the WAF / security group, not host iptables — {{TBD: production edge rate-limit procedure — depends on customer-team CDN/WAF choice (mirrors the same TBD in the master incident runbook)}}.
D. Reduce push volume (defensive, no code change)¶
Some pushes are best-effort and can be dropped under load. Specifically the UsersOnlineUpdated broadcast (10s cadence, see ../flows/rt-websocket.md §6.3) and any leaderboard tickers can be temporarily silenced via env flag — {{TBD: confirm env flag exists; if not, this option is unavailable until shipped}}.
Verification¶
After applying a fix:
- Connection success rate > 99% on fresh handshakes: open dropbet in 5 incognito tabs over 30s; all should connect.
- p95 handshake latency < 200 ms (Grafana
service-overviewrt panel). Beyond 500 ms indicates the gateway is still saturated. - No
Too many connections/Too many requestserrors in Loki for the next 5 min after the fix lands. - CPU returns to < 70% sustained on the rt container —
docker stats ebit-rt. - Throttler ban count drops to zero:
docker exec ebit-redis redis-cli -a cache --scan --pattern 'throttle:*' | wc -l. - End-to-end smoke: place a bet via the UI; confirm WS-pushed
BalanceUpdated(or equivalent) arrives within ~1s.
Prevention¶
- Capacity planning: every release cuts a perf run that includes
ws-storm.js. Track p95 handshake + active-connections-per-replica trend across releases. The ceiling is the data, not the guess. - Alert on ws active connections at 80% of measured ceiling: needs the ceiling first (perf-derived). File a follow-up.
- Monitor BullMQ queue depth as a ws push proxy: if
bull:*-broadcast:waitclimbs > 100, the gateway is dropping or back-pressuring. Currently no Grafana alert wired —{{TBD: file alert for bullmq broadcast queue depth > 100 for > 60 s}}. - Ship the Redis adapter (§Fix B). Until that lands, ebit-rt is single-replica and every Black Friday-shaped event puts you in this runbook.
- Edge rate-limit on
:4001at the CDN / WAF layer — same hygiene as auth endpoints. Currently not configured.
Cross-references¶
../flows/rt-websocket.md— full rt design, the AF-3 scale-out gap, online-count details../architecture.md— AF-3 in the known-weakness register../adr/— ADR home for the Redis-adapter follow-upbullmq-job-stuck.md— when ws back-pressure shows up downstream as queue depthredis-memory-pressure.md— adapter pub/sub increases cache Redis load; sizing matters../handover/oncall-runbook.md§3 (WebSocket handshake storm) — first-response procedure that brought you here