Performance Test Run Checklist¶
Operator-facing pre-flight, in-flight, and post-flight checklist for the stepped-ramp test. Designed to be skimmed under pressure.
Pre-flight (before terraform apply)¶
- [ ] AWS credentials valid:
aws sts get-caller-identityreturns the perf account - [ ]
terraform/perf/terraform.tfvarsexists (copied from.tfvars.example),admin_cidrsset or left empty for auto-detect - [ ]
cd terraform/perf && terraform init && terraform planshows ~58 adds, 0 change, 0 destroy - [ ] AWS service-quota headroom checked in
eu-north-1: VPC limit, EIP limit,c7ginstance family vCPU quota - [ ] Local workstation has:
awscli,docker, arm64k6binary, Node 22+,pnpm
Apply + provision (~10 min)¶
- [ ]
terraform apply -auto-approvecompletes without error - [ ] Capture outputs — save to a scratch file for the rest of this checklist:
- [ ] ECR login from workstation:
- [ ] Build + push all 5 ebit-api images + ebit-fe (per
terraform/perf/README.mdpush loop) - [ ] SSH to SUT VM, pull and start services:
- [ ] Wait for API health:
curl -sf http://<sut_public_ip>:4000/api/healthreturns 200 - [ ] Run Prisma migrations + seed:
- [ ] Seed load-test users on loadgen VM (or from workstation):
- [ ] Verify
users.jsonon loadgen has 10,000 entries:wc -l users.json - [ ] Open Grafana at
<grafana_url>, navigate to dashboardebit-perf-test— confirm datasource connects (panels show "No data" is fine pre-test, but no red error banners)
Smoke test (50 VUs / 1 min)¶
- [ ] From loadgen VM:
- [ ] Exit code 0, all 4 scenarios ran (signin, bet, history, ws)
- [ ] Grafana:
k6_vuspanel shows 50, Service RED row shows non-zero request rate - [ ] Click a latency spike in Grafana — verify Jaeger trace link opens a real trace
- [ ] Sign-in p95 < 200 ms (bcrypt overhead expected)
- [ ] Dice bet p95 >= 100 ms is expected (pre-existing SLO exception — not a blocker)
Stepped ramp (1k to 10k / 42 min)¶
- [ ] Open in separate browser tabs: Grafana perf-test dashboard, Jaeger search, SUT
docker stats - [ ] Start screen recording if possible (evidence for report)
- [ ] Launch ramp:
- [ ] At each stage-transition annotation in Grafana, note:
- p95 per endpoint (scorecard table in #67)
- Error rate (k6 Error Rate panel)
- WS handshake success rate
- CPU/memory on SUT (cadvisor panel or
docker stats) - [ ] Manual abort if any of: sign-in p95 > 500 ms, error rate > 5%, CPU pegged at 100% for > 60 s (k6 auto-abort catches SLO breaches, but manual cutoff limits blast radius)
- [ ] Per stage, capture:
- Grafana dashboard screenshot (full width)
- k6 summary export
- 2-3 Jaeger exemplar trace IDs from the slowest endpoint at that stage
Post-test¶
- [ ] Copy k6 results from loadgen to workstation:
scp loadgen:~/results/* tests-perf/results/ - [ ] Export Grafana dashboard snapshots per stage (or use screen recording)
- [ ] Collect Jaeger trace IDs for slowest and failed requests
- [ ] Fill
docs/performance-test-report.md(#67): replace every{{TBD}}with captured numbers, trace IDs, and screenshot paths - [ ] Tear down infrastructure:
- [ ] Verify ECR repos deleted (
force_delete=trueshould handle this) - [ ] Check AWS Cost Explorer after 24 hours — expected cost ~$1.50 for ~1.5 hr of
c7ginstances
Abort-and-triage matrix¶
If a stage breaches, use this table to decide whether to iterate or halt.
| Breach type | Likely cause | First query to run | Decision |
|---|---|---|---|
| Prisma db_query p95 climbing | Postgres connection pool exhaustion (default ~10/service) | histogram_quantile(0.95, sum by (le) (rate(duration_milliseconds_bucket{span_name="prisma:engine:db_query"}[1m]))) |
If pool-tunable: increase pool, re-smoke, resume. If architectural: halt, document ceiling. |
| BullMQ wait depth growing | Worker throughput < arrival rate | bullmq_queue_jobs{state="wait"} |
If single queue: throttle that scenario, resume. If widespread: halt. |
| WS handshake timeouts | rt event-loop saturation or socket Map exhaustion | Check rt container CPU + RSS via docker stats. No built-in metric yet. |
Halt if CPU saturated — rt can't scale horizontally without Redis adapter. |
| HTTP 429 spike | Throttle guard triggered (test artifact, not a real bottleneck) | sum(rate(calls_total{status_code="STATUS_CODE_ERROR",span_name=~".*"}[1m])) by (span_name) — check if errors are 429s |
Reconfigure throttle limits or add rate-limit bypass header, then resume. |
| Container OOM killed | RSS exceeded Docker memory limit | dmesg | grep -i oom on SUT |
Halt. Increase container memory limit or reduce concurrency. |
| k6 itself saturated | Loadgen CPU at 100%, k6 can't generate target VU rate | top on loadgen VM — k6 process CPU |
Results unreliable. Halt, note loadgen as bottleneck, consider distributing k6. |
When NOT to proceed to the next stage¶
Do not advance to stage N+1 if stage N did not stabilize within 2 minutes of its SLO threshold. "Stabilize" means: p95 is flat (not still climbing), error rate is flat, and BullMQ wait depth is flat. If any metric is still trending upward at the 2-minute mark, the system has not absorbed the current load — adding more will only produce compounding failures that obscure the original bottleneck. Halt the ramp, capture evidence at the current stage, and document the ceiling. Resuming from a higher stage later (after a fix) is always an option; pushing through a breach is not.