Performance Test Run Checklist¶

Operator-facing pre-flight, in-flight, and post-flight checklist for the stepped-ramp test. Designed to be skimmed under pressure.

Pre-flight (before terraform apply)¶

[ ] AWS credentials valid: aws sts get-caller-identity returns the perf account
[ ] terraform/perf/terraform.tfvars exists (copied from .tfvars.example), admin_cidrs set or left empty for auto-detect
[ ] cd terraform/perf && terraform init && terraform plan shows ~58 adds, 0 change, 0 destroy
[ ] AWS service-quota headroom checked in eu-north-1: VPC limit, EIP limit, c7g instance family vCPU quota
[ ] Local workstation has: awscli, docker, arm64 k6 binary, Node 22+, pnpm

Apply + provision (~10 min)¶

[ ] terraform apply -auto-approve completes without error

[ ] Capture outputs — save to a scratch file for the rest of this checklist:

terraform output -json > /tmp/perf-outputs.json
# Keys: monitoring_url, grafana_url, jaeger_url, sut_public_ip, loadgen_public_ip

[ ] ECR login from workstation:

aws ecr get-login-password --region eu-north-1 | docker login --username AWS --password-stdin <account>.dkr.ecr.eu-north-1.amazonaws.com

[ ] Build + push all 5 ebit-api images + ebit-fe (per terraform/perf/README.md push loop)

[ ] SSH to SUT VM, pull and start services:

ssh -i key.pem ubuntu@<sut_public_ip>
cd /opt/ebit && docker compose pull && docker compose up -d

[ ] Wait for API health: curl -sf http://<sut_public_ip>:4000/api/health returns 200

[ ] Run Prisma migrations + seed:

docker compose exec api npm run db:migrate:deploy && docker compose exec api npm run db:seed

[ ] Seed load-test users on loadgen VM (or from workstation):

API_URL=http://<sut_public_ip>:4000 TEST_USER_COUNT=10000 npx tsx tests-perf/seed/seed-load-users.ts

[ ] Verify users.json on loadgen has 10,000 entries: wc -l users.json
[ ] Open Grafana at <grafana_url>, navigate to dashboard ebit-perf-test — confirm datasource connects (panels show "No data" is fine pre-test, but no red error banners)

Smoke test (50 VUs / 1 min)¶

[ ] From loadgen VM:

k6 run --out experimental-prometheus-rw tests-perf/k6/smoke.js

[ ] Exit code 0, all 4 scenarios ran (signin, bet, history, ws)
[ ] Grafana: k6_vus panel shows 50, Service RED row shows non-zero request rate
[ ] Click a latency spike in Grafana — verify Jaeger trace link opens a real trace
[ ] Sign-in p95 < 200 ms (bcrypt overhead expected)
[ ] Dice bet p95 >= 100 ms is expected (pre-existing SLO exception — not a blocker)

Stepped ramp (1k to 10k / 42 min)¶

[ ] Open in separate browser tabs: Grafana perf-test dashboard, Jaeger search, SUT docker stats
[ ] Start screen recording if possible (evidence for report)

[ ] Launch ramp:

k6 run --out experimental-prometheus-rw tests-perf/profiles/stepped-ramp.js

[ ] At each stage-transition annotation in Grafana, note:
p95 per endpoint (scorecard table in #67)
Error rate (k6 Error Rate panel)
WS handshake success rate
CPU/memory on SUT (cadvisor panel or docker stats)
[ ] Manual abort if any of: sign-in p95 > 500 ms, error rate > 5%, CPU pegged at 100% for > 60 s (k6 auto-abort catches SLO breaches, but manual cutoff limits blast radius)
[ ] Per stage, capture:
Grafana dashboard screenshot (full width)
k6 summary export
2-3 Jaeger exemplar trace IDs from the slowest endpoint at that stage

Post-test¶

[ ] Copy k6 results from loadgen to workstation: scp loadgen:~/results/* tests-perf/results/
[ ] Export Grafana dashboard snapshots per stage (or use screen recording)
[ ] Collect Jaeger trace IDs for slowest and failed requests
[ ] Fill docs/performance-test-report.md (#67): replace every {{TBD}} with captured numbers, trace IDs, and screenshot paths

[ ] Tear down infrastructure:

cd terraform/perf && terraform destroy -auto-approve

[ ] Verify ECR repos deleted (force_delete=true should handle this)
[ ] Check AWS Cost Explorer after 24 hours — expected cost ~$1.50 for ~1.5 hr of c7g instances

Abort-and-triage matrix¶

If a stage breaches, use this table to decide whether to iterate or halt.

Breach type	Likely cause	First query to run	Decision
Prisma db_query p95 climbing	Postgres connection pool exhaustion (default ~10/service)	`histogram_quantile(0.95, sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m])))`	If pool-tunable: increase pool, re-smoke, resume. If architectural: halt, document ceiling.
BullMQ wait depth growing	Worker throughput < arrival rate	`bullmq_queue_jobs{state="wait"}`	If single queue: throttle that scenario, resume. If widespread: halt.
WS handshake timeouts	rt event-loop saturation or socket Map exhaustion	Check rt container CPU + RSS via `docker stats`. No built-in metric yet.	Halt if CPU saturated — rt can't scale horizontally without Redis adapter.
HTTP 429 spike	Throttle guard triggered (test artifact, not a real bottleneck)	`sum(rate(calls_total{status_code="STATUS_CODE_ERROR",span_name=~".*"}[1m])) by (span_name)` — check if errors are 429s	Reconfigure throttle limits or add rate-limit bypass header, then resume.
Container OOM killed	RSS exceeded Docker memory limit	`dmesg \| grep -i oom` on SUT	Halt. Increase container memory limit or reduce concurrency.
k6 itself saturated	Loadgen CPU at 100%, k6 can't generate target VU rate	`top` on loadgen VM — k6 process CPU	Results unreliable. Halt, note loadgen as bottleneck, consider distributing k6.

When NOT to proceed to the next stage¶

Do not advance to stage N+1 if stage N did not stabilize within 2 minutes of its SLO threshold. "Stabilize" means: p95 is flat (not still climbing), error rate is flat, and BullMQ wait depth is flat. If any metric is still trending upward at the 2-minute mark, the system has not absorbed the current load — adding more will only produce compounding failures that obscure the original bottleneck. Halt the ramp, capture evidence at the current stage, and document the ceiling. Resuming from a higher stage later (after a fix) is always an option; pushing through a breach is not.