Skip to content

Performance Testing Methodology

1. Goal

Validate service-level objectives (SLOs) under a stepped ramp from 50 to 10,000 concurrent virtual users. The ramp profile is:

50 -> 1,000 -> 2,500 -> 5,000 -> 7,500 -> 10,000

The primary deliverables are:

  • The concurrency ceiling at which each SLO first breaches.
  • A bottleneck attribution for every breach, traced back to a specific subsystem (CPU, Postgres, Redis, BullMQ, or application code).
  • Reproduction steps that allow any engineer to re-run the test and arrive at comparable numbers.

2. Tooling Rationale

k6 (synthetic load generation)

k6 drives both HTTP and WebSocket traffic. It supports Prometheus remote-write natively (--out experimental-prometheus-rw), which feeds real-time panels in Grafana without any intermediary. k6 scripts live under tests-perf/k6/ and tests-perf/profiles/.

Playwright canary (real-browser UX validation)

A small Playwright suite runs alongside k6 to measure real-browser metrics (LCP, FCP, bet-place latency) under load. This validates that the user experience degrades gracefully rather than catastrophically. Canary tests live under tests-perf/playwright-canary/.

Playwright is not used for load generation. A single Chromium instance consumes roughly 300 MB of RAM and significant CPU. Scaling to 10,000 concurrent sessions is not feasible, nor is it the tool's design intent.

Existing observability stack

The platform already emits OpenTelemetry traces to Jaeger, spanmetrics-derived counters/histograms to Prometheus, and structured logs to Loki. Grafana dashboards unify all three signals. No additional instrumentation is required for bottleneck attribution.

Relevant metric families:

Metric Source Labels
calls_total spanmetrics connector service_name, span_name, span_kind, status_code
duration_milliseconds_bucket spanmetrics connector same as above; buckets: 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000 ms
http_server_duration_milliseconds_bucket OTel HTTP instrumentation http_route, http_status_code, service_name
bullmq_queue_jobs custom gauge queue, state

3. Test Environment

Single-VM development setup

A single 16 GB VM runs both the load generator and all services (NestJS apps, Postgres, Redis, Docker infrastructure). This is the default for local development and CI.

Caveat: Numbers from a co-located setup are directional, not authoritative. The load generator competes with the services under test for CPU and RAM, which means observed latencies will be higher (and throughput ceilings lower) than in production. Customers should re-run on dedicated infrastructure for authoritative results.

Multi-VM production-grade setup

For isolated, reproducible results, deploy the test environment across dedicated VMs using the Terraform modules:

Module Path Role
Monitoring VM terraform/modules/monitoring Prometheus, Grafana, Loki, Jaeger, OTel Collector
Application VM terraform/modules/app NestJS apps (api, rt, bj, bo, speed-roulette), Postgres, Redis
Perf wiring terraform/perf/ Ties monitoring and app modules together for perf environments

In a multi-VM layout, the load generator runs on a third machine (or on the monitoring VM if its resource footprint is low). This eliminates contention between k6 and the services under test.

4. SLO Definitions

Endpoint latency targets

Endpoint Method p95 Target Notes
/auth/sign-in POST 150 ms bcrypt is irreducible at ~60-80 ms per hash; the budget allows for DB lookup and session creation on top
/casino/games/house/dice/bet POST 100 ms Baseline measured at 108 ms with 1 VU -- this endpoint does not meet SLO even at minimal load. Pre-noted as SLO-unmet; optimization work is required before this target is achievable
/bets GET 50 ms Paginated bet history
/accounting/balances GET 50 ms Cached balance lookup
rt WebSocket handshake WS 200 ms Measured as time from TCP connect to receipt of AuthSuccess event

System-level SLOs

  • Error rate: less than 0.1% per endpoint across the full test duration.
  • Queue stability: bullmq_queue_jobs{state="wait"} must not trend upward over any 2-minute window. Sustained growth indicates worker throughput is below arrival rate.
  • No OOM kills: no container may be killed by the kernel OOM killer during the test. Validated via dmesg and container exit codes.

5. Stepped-Ramp Protocol

Stage definition

Stage Target VUs Duration Purpose
1 -- Warmup 50 2 min Populate caches, establish baseline
2 1,000 5 min Light production-equivalent load
3 2,500 5 min Moderate concurrency
4 5,000 5 min High concurrency
5 7,500 5 min Stress region
6 10,000 5 min Peak target

Total test duration: 27 minutes.

Auto-abort thresholds

The k6 scenario must abort the current stage (and log the breach) when any of the following conditions hold for 30 consecutive seconds:

  • p95 latency exceeds 2x the SLO budget for any monitored endpoint.
  • Aggregate error rate exceeds 1%.
  • Any container is OOM-killed (detected via a sidecar health check or docker events watcher).

Artifact capture

At each breach moment, capture:

  1. A Grafana dashboard snapshot covering the 5 minutes surrounding the breach.
  2. Jaeger exemplar trace IDs for the slowest requests at the breach point.
  3. The k6 summary output at the time of abort.

6. Bottleneck-Hunting Runbook

Symptom-to-diagnosis map

Metric Pattern Diagnosis PromQL Example
duration_milliseconds p95 rising across all routes simultaneously CPU or RAM saturation at the container level. Check container health before investigating application code. histogram_quantile(0.95, sum(rate(duration_milliseconds_bucket{span_kind="SPAN_KIND_SERVER"}[1m])) by (le, service_name))
prisma:engine:db_query p95 rising Postgres connection pool saturation. The default pool size is ~10 connections per service instance. histogram_quantile(0.95, sum(rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])) by (le, service_name))
ioredis command rate hitting a ceiling Cache Redis instance is saturated (CPU-bound, since Redis is single-threaded). sum(rate(calls_total{span_kind="SPAN_KIND_CLIENT",span_name=~"(?i)(get\|set\|del\|evalsha\|hget\|hset\|expire\|publish)"}[1m])) by (service_name)
bullmq_queue_jobs{state="wait"} growing over time Worker throughput is below arrival rate. Either add workers or optimize job processing time. bullmq_queue_jobs{state="wait"} (raw gauge, watch for sustained positive slope)
rt socket count per instance exceeds expected capacity The rt service stores socket state in an in-process Map. Without a Redis adapter for socket.io, sticky sessions are required and per-instance capacity is bounded by memory. No built-in metric; monitor via docker exec ebit-rt node -e "..." or add a custom gauge. The rt service does not currently expose a connection count metric.
Container CPU utilization above 90% or RSS approaching memory limit Vertical scaling ceiling reached. Scale up (larger instance) or scale out (more replicas). Not available without cadvisor/node_exporter -- this is a known observability gap. Use docker stats as a stopgap.
k6_http_req_failed rising Upstream is returning 5xx errors. Correlate the failing route with Jaeger exemplar traces to identify the root cause. sum(rate(k6_http_req_failed[1m])) by (url) (available only when k6 remote-write is active)

Investigating a specific breach

  1. Open the Grafana dashboard and identify the timestamp where the SLO breach begins.
  2. Filter Jaeger traces to the affected service and time window. Sort by duration descending.
  3. In the slowest trace, identify which span contributes the most wall-clock time (database query, Redis call, downstream HTTP, or application code).
  4. Cross-reference with Loki logs for the same trace_id to check for error messages or warnings.
  5. If the bottleneck is infrastructure (Postgres, Redis), check connection pool metrics and resource utilization. If it is application code, profile the specific handler.

7. Reproduction

All commands assume the working directory is the repository root.

Start infrastructure

sudo docker compose up -d

This starts Postgres, Redis, RabbitMQ, and the observability stack (Prometheus, Grafana, Loki, Jaeger, OTel Collector).

Run a k6 smoke test

k6 run --out experimental-prometheus-rw tests-perf/k6/smoke.js

The smoke test sends minimal traffic (1-5 VUs) to verify that all endpoints respond correctly and that k6 metrics appear in Grafana.

Run the stepped ramp

k6 run --out experimental-prometheus-rw tests-perf/profiles/stepped-ramp.js

This executes the full 27-minute ramp described in Section 5. Auto-abort thresholds are configured within the script.

Run the Playwright canary

npx playwright test tests-perf/playwright-canary/

Run this in a separate terminal while k6 is active. The canary reports LCP, FCP, and bet-place latency as the system is under load.

Open dashboards

Tool URL
Grafana http://localhost:3000
Jaeger http://localhost:16686

8. Limitations

  • Co-located load generation. On a single-VM setup, k6 competes with the services under test for CPU and RAM. Observed latencies will be pessimistic; throughput ceilings will be optimistic (k6 itself may become the bottleneck before the services do).

  • Stubbed external services. reCAPTCHA validation is bypassed in the test environment. The Fast Track integration is disabled (the RabbitMQ producer is stubbed). The EVO wallet is stubbed. These stubs remove real-world latency and failure modes from the test.

  • Single Postgres instance. Production may use read replicas to offload query traffic. The test environment runs a single instance, so Postgres will saturate earlier than in a replicated deployment.

  • No Redis Cluster. Redis runs as a single instance. In production, a cluster or sentinel setup would provide higher throughput and failover.

  • rt WebSocket lacks a Redis adapter. The socket.io gateway in the rt service does not use a Redis adapter for pub/sub. This means horizontal scaling requires sticky sessions, and per-instance socket capacity is bounded by the in-process connection Map.

  • k6 WebSocket scenario scope. The k6 WebSocket scenario measures transport-level metrics (handshake time, message round-trip). It does not exercise full game logic, which requires multi-step stateful interactions that are better validated by the Playwright canary.

9. Kernel Tuning Checklist (Load Generator)

These settings are required on the load generator machine to sustain 10,000+ concurrent connections. Apply them before running the stepped ramp.

Setting Command Value Rationale
Open file limit ulimit -n 65536 65536 Each TCP socket consumes one file descriptor. The default limit (typically 1024) is insufficient for 10k connections.
Listen backlog sysctl net.core.somaxconn=65535 65535 Increases the accept queue depth for listening sockets, preventing connection drops under burst.
TIME_WAIT reuse sysctl net.ipv4.tcp_tw_reuse=1 1 Allows reuse of sockets in TIME_WAIT state for new outbound connections, reclaiming ports faster.
Local port range sysctl net.ipv4.ip_local_port_range="1024 65535" 1024-65535 Expands the ephemeral port range from the default (~28k ports) to ~64k ports.
FIN timeout sysctl net.ipv4.tcp_fin_timeout=15 15 seconds Reduces the time sockets spend in FIN_WAIT_2 state, accelerating teardown. Default is 60 seconds.
System-wide fd limit sysctl fs.file-max=2097152 2097152 Raises the kernel-level cap on open file descriptors across all processes.

To apply sysctl settings persistently, add them to /etc/sysctl.d/99-perf.conf and run sysctl --system. The ulimit setting must be configured in /etc/security/limits.conf or the shell profile for persistence.