Skip to content

Performance Test Run Checklist

Operator-facing pre-flight, in-flight, and post-flight checklist for the stepped-ramp test. Designed to be skimmed under pressure.

Pre-flight (before terraform apply)

  • [ ] AWS credentials valid: aws sts get-caller-identity returns the perf account
  • [ ] terraform/perf/terraform.tfvars exists (copied from .tfvars.example), admin_cidrs set or left empty for auto-detect
  • [ ] cd terraform/perf && terraform init && terraform plan shows ~58 adds, 0 change, 0 destroy
  • [ ] AWS service-quota headroom checked in eu-north-1: VPC limit, EIP limit, c7g instance family vCPU quota
  • [ ] Local workstation has: awscli, docker, arm64 k6 binary, Node 22+, pnpm

Apply + provision (~10 min)

  • [ ] terraform apply -auto-approve completes without error
  • [ ] Capture outputs — save to a scratch file for the rest of this checklist:
    terraform output -json > /tmp/perf-outputs.json
    # Keys: monitoring_url, grafana_url, jaeger_url, sut_public_ip, loadgen_public_ip
    
  • [ ] ECR login from workstation:
    aws ecr get-login-password --region eu-north-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.eu-north-1.amazonaws.com
    
  • [ ] Build + push all 5 ebit-api images + ebit-fe (per terraform/perf/README.md push loop)
  • [ ] SSH to SUT VM, pull and start services:
    ssh -i key.pem ubuntu@<sut_public_ip>
    cd /opt/ebit && docker compose pull && docker compose up -d
    
  • [ ] Wait for API health: curl -sf http://<sut_public_ip>:4000/api/health returns 200
  • [ ] Run Prisma migrations + seed:
    docker compose exec api npm run db:migrate:deploy && docker compose exec api npm run db:seed
    
  • [ ] Seed load-test users on loadgen VM (or from workstation):
    API_URL=http://<sut_public_ip>:4000 TEST_USER_COUNT=10000 npx tsx tests-perf/seed/seed-load-users.ts
    
  • [ ] Verify users.json on loadgen has 10,000 entries: wc -l users.json
  • [ ] Open Grafana at <grafana_url>, navigate to dashboard ebit-perf-test — confirm datasource connects (panels show "No data" is fine pre-test, but no red error banners)

Smoke test (50 VUs / 1 min)

  • [ ] From loadgen VM:
    k6 run --out experimental-prometheus-rw tests-perf/k6/smoke.js
    
  • [ ] Exit code 0, all 4 scenarios ran (signin, bet, history, ws)
  • [ ] Grafana: k6_vus panel shows 50, Service RED row shows non-zero request rate
  • [ ] Click a latency spike in Grafana — verify Jaeger trace link opens a real trace
  • [ ] Sign-in p95 < 200 ms (bcrypt overhead expected)
  • [ ] Dice bet p95 >= 100 ms is expected (pre-existing SLO exception — not a blocker)

Stepped ramp (1k to 10k / 42 min)

  • [ ] Open in separate browser tabs: Grafana perf-test dashboard, Jaeger search, SUT docker stats
  • [ ] Start screen recording if possible (evidence for report)
  • [ ] Launch ramp:
    k6 run --out experimental-prometheus-rw tests-perf/profiles/stepped-ramp.js
    
  • [ ] At each stage-transition annotation in Grafana, note:
  • p95 per endpoint (scorecard table in #67)
  • Error rate (k6 Error Rate panel)
  • WS handshake success rate
  • CPU/memory on SUT (cadvisor panel or docker stats)
  • [ ] Manual abort if any of: sign-in p95 > 500 ms, error rate > 5%, CPU pegged at 100% for > 60 s (k6 auto-abort catches SLO breaches, but manual cutoff limits blast radius)
  • [ ] Per stage, capture:
  • Grafana dashboard screenshot (full width)
  • k6 summary export
  • 2-3 Jaeger exemplar trace IDs from the slowest endpoint at that stage

Post-test

  • [ ] Copy k6 results from loadgen to workstation: scp loadgen:~/results/* tests-perf/results/
  • [ ] Export Grafana dashboard snapshots per stage (or use screen recording)
  • [ ] Collect Jaeger trace IDs for slowest and failed requests
  • [ ] Fill docs/performance-test-report.md (#67): replace every {{TBD}} with captured numbers, trace IDs, and screenshot paths
  • [ ] Tear down infrastructure:
    cd terraform/perf && terraform destroy -auto-approve
    
  • [ ] Verify ECR repos deleted (force_delete=true should handle this)
  • [ ] Check AWS Cost Explorer after 24 hours — expected cost ~$1.50 for ~1.5 hr of c7g instances

Abort-and-triage matrix

If a stage breaches, use this table to decide whether to iterate or halt.

Breach type Likely cause First query to run Decision
Prisma db_query p95 climbing Postgres connection pool exhaustion (default ~10/service) histogram_quantile(0.95, sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))) If pool-tunable: increase pool, re-smoke, resume. If architectural: halt, document ceiling.
BullMQ wait depth growing Worker throughput < arrival rate bullmq_queue_jobs{state="wait"} If single queue: throttle that scenario, resume. If widespread: halt.
WS handshake timeouts rt event-loop saturation or socket Map exhaustion Check rt container CPU + RSS via docker stats. No built-in metric yet. Halt if CPU saturated — rt can't scale horizontally without Redis adapter.
HTTP 429 spike Throttle guard triggered (test artifact, not a real bottleneck) sum(rate(calls_total{status_code="STATUS_CODE_ERROR",span_name=~".*"}[1m])) by (span_name) — check if errors are 429s Reconfigure throttle limits or add rate-limit bypass header, then resume.
Container OOM killed RSS exceeded Docker memory limit dmesg | grep -i oom on SUT Halt. Increase container memory limit or reduce concurrency.
k6 itself saturated Loadgen CPU at 100%, k6 can't generate target VU rate top on loadgen VM — k6 process CPU Results unreliable. Halt, note loadgen as bottleneck, consider distributing k6.

When NOT to proceed to the next stage

Do not advance to stage N+1 if stage N did not stabilize within 2 minutes of its SLO threshold. "Stabilize" means: p95 is flat (not still climbing), error rate is flat, and BullMQ wait depth is flat. If any metric is still trending upward at the 2-minute mark, the system has not absorbed the current load — adding more will only produce compounding failures that obscure the original bottleneck. Halt the ramp, capture evidence at the current stage, and document the ceiling. Resuming from a higher stage later (after a fix) is always an option; pushing through a breach is not.