Phased Rollout¶

Eight phases from first kickoff call to General Availability. Each phase is a self-contained slice with a goal, duration estimate, hard inputs, numbered activities, deliverables (artifacts the customer keeps), acceptance criteria (objective and testable — see Acceptance Criteria), and the common pitfalls we have seen.

Phases are sequential by default. Phases 4 and 5 can overlap if the customer engineering team is sized to run two work-streams in parallel.

#	Phase	Low / Typical / High
1	Discovery	1 / 1.5 / 2 wk
2	Local stack	1 / 3 / 5 days
3	Stage environment	1 / 1.5 / 2 wk
4	Integration (payments / KYC / branding / flags)	2 / 4 / 6 wk
5	Performance validation	3 days / 1 wk / 1.5 wk
6	UAT	2 / 3 / 4 wk
7	Pilot launch	1 / 2 / 3 wk
8	GA	3 days / 1 wk / 2 wk

End-to-end typical path: ~14 calendar weeks. High path (with vendor approval lag, design iteration, and a re-baseline on perf): ~22 weeks.

Phase 1 — Discovery (1 / 1.5 / 2 wk)¶

Goal¶

Confirm the platform fits the customer's commercial, regulatory, and architectural envelope. Produce a written gap analysis the customer's leadership can sign off on. Kick off the long-lead-time vendor approvals (Sumsub KYC and any payment provider) so they finish before they block.

Inputs¶

People: customer PM, customer architect, customer compliance / legal sponsor, Evospin delivery lead. One workshop facilitator on each side.
Materials: customer's target-market list (jurisdictions), brand guidelines (logo, palette), planned currency + locale set, draft game catalog selection, planned payment methods.
Decisions deferred: none — Discovery is precisely the place to surface and book them.

Activities¶

Walk through the architecture doc, external services doc, and the 15 flow docs. Flag any flow that is out-of-scope for the customer's product (for example, the customer may not enable speed-roulette).
Run the dependencies checklist end-to-end with the customer. For each row, confirm "have / will obtain / N/A".
Make the scope decisions that block downstream phases:
Game catalog (which house games, sportsbook on/off, blackjack on/off, speed-roulette on/off — see AF-4 in architecture.md re: orphan ebit-bj app).
Game-provider integrations (PM8 / Softswiss / both / neither — out-of-the-box catalog only).
Payment provider stack (CCPAYMENT and/or NowPayments and/or {{TBD: customer-preferred provider}}).
KYC vendor (Sumsub vs alternative — see Risks #3).
Locale set (en, de ship today; additional locales add a Phase-4 task).
Currency set (DBC native + which fiat / crypto pairs).
Open the long-lead vendor accounts in parallel:
Sumsub sandbox application (typical 2–4 wk approval).
Payment provider sandbox application (varies by provider; CCPAYMENT typical 1–2 wk).
GeeTest / reCAPTCHA enterprise key (same-day if customer already has Google Cloud).
SendGrid / SMTP provider (same-day).
{{TBD: gaming license — jurisdiction dependent; this can run for months and is the single biggest schedule risk}}.
Plan the customer's AWS landing zone: account, VPC strategy, KMS key for Doppler, IAM principals, certificate authority. Reference terraform/perf/README.md for the Phase-3 baseline.

Deliverables¶

Gap analysis document (customer-owned). For each subsystem in architecture.md, one of: "in scope, no change", "in scope, needs config", "in scope, needs code change", "out of scope".
Scope decision log (customer-owned). One row per decision listed in Activities §3, with owner, decision, date.
Vendor request tracker. Each long-lead account (KYC, payments, captcha, email, license) with state requested / pending-review / approved / rejected and an expected-by date.
Provisional schedule. Working assumption for Phases 2–8 with the customer's named owners.

Acceptance criteria¶

Scope decision log signed by customer PM and customer compliance sponsor.
All long-lead vendor requests submitted (state requested or later) — none in not yet asked.
See Acceptance Criteria → Phase 1.

Common pitfalls¶

Skipping the gap analysis. Customer assumes the platform "just runs" against their existing payment processor; Phase 4 then surfaces a 6-week integration. Mitigation: make the gap analysis a hard exit gate.
Letting KYC slip to Phase 4. Sumsub approval typically arrives 2–4 weeks after submission; if requested in Phase 4 it blocks UAT. Mitigation: open the vendor requests on Day 1 of Phase 1.
Soft scope on game catalog. "We'll decide later" pushes a multi-week integration into Phase 6. Mitigation: Activity §3 is a hard checklist with named decision owners.

Phase 2 — Local stack (1 / 3 / 5 days)¶

Goal¶

Customer engineering team has the entire Evospin stack running on each developer laptop. A test bet is placed end-to-end and a trace appears in Jaeger. This phase exists so the customer's engineers learn the system before they have to operate it.

Inputs¶

People: every customer engineer who will touch the codebase, plus one Evospin handover engineer for office hours.
Materials: customer-issued laptops with at least 16 GB RAM and 40 GB free disk; Docker 24+, Node 22, pnpm 9.11, npm 10+, git 2.40+.
Optional: customer Doppler workspace (otherwise the .env.example plain-.env path works).

Activities¶

Each engineer follows docs/onboarding/day-one.md end-to-end. Stop at the §10 exit checklist.
Pair-walk through the bet pipeline using docs/flows/dropbet-bet-place.md, then docs/flows/dropbet-sign-in.md and docs/flows/rt-websocket.md. The architecture doc's "read order suggestions for new hires" section names which three flows fit each role.
Run the cross-service Playwright canary: cd tests-e2e && pnpm test. Every spec referenced by the architecture doc §4 should pass against the local stack.
Walk through the runbooks library — bullmq-job-stuck.md, trace-missing.md, loki-no-logs.md, recaptcha-fails-locally.md — so the team has hit each at least once before stage.
Skim docs/recipes/ so the team knows where the patterns for adding a REST endpoint, BullMQ queue, OTel span, RT socket event, Prisma model, Grafana dashboard, and Playwright spec live.

Deliverables¶

Each engineer has placed a bet locally and seen a trace.
Customer-side notes / wiki entry capturing any environment-specific deviations (proxy, certificate trust, IDE config) for the next person who joins the team.

Acceptance criteria¶

docker compose ps shows all services healthy with zero restarts on each engineer's machine.
Cross-service Playwright suite green on at least one customer-side machine.
See Acceptance Criteria → Phase 2.

Common pitfalls¶

First build OOMs on 8 GB laptops. Build peak is ~7 GB. Mitigation: require 16 GB; document workaround (build images one-at-a-time on memory-constrained machines).
Engineers ignore the pnpm vs npm discipline. Mixing corrupts lockfiles. Mitigation: pin in CONTRIBUTING.md; lint-staged catches at commit.
Customer expects npx prisma to work. Always wrapped with env-cmd — see day-one.md §9. Mitigation: runbook entry; the npm scripts are the only supported path.

Phase 3 — Stage environment (1 / 1.5 / 2 wk)¶

Goal¶

Provision a production-shaped stage environment in the customer's AWS account using the published Terraform. End-to-end traffic from the customer's own DNS reaches dropbet, traces appear in Jaeger, dashboards populate in Grafana.

Inputs¶

People: customer SRE / platform engineer, customer DNS owner, Evospin handover engineer.
Materials: AWS account with the IAM permissions in terraform/perf/README.md; existing VPC and subnet IDs; EC2 key pair; admin CIDR (workstation IPs); domain names for dropbet, admin, Grafana, Jaeger; ACM or upstream-issued TLS certificates.
Decisions: Doppler workspace structure (one config per environment: dev, dev_stage, dev_perf, prd); KMS key for Doppler; ECR repo policy.

Activities¶

Terraform apply per terraform/perf/README.md. Three arm64 EC2 hosts: SUT (c7g.4xlarge), monitoring (c7g.2xlarge), load-gen (c7g.4xlarge); one ECR repo per service.
Build + push all 7 service images (api, rt, bj, bo, speed-roulette, ebit-fe, ebit-admin-fe) to ECR per the README §1.
Doppler perf-config setup. Mirror the audit in docs/audits/doppler-perf-audit.md for stage: NODE_ENV=production, DEFAULT_LOG_LEVEL=info, all 17 critical secrets populated, FASTTRACK_JWT_* stub values until Fast Track is enabled.
DNS + TLS. Point dropbet.{customer}, admin.{customer}, grafana.{customer}, jaeger.{customer} at the SUT and monitoring public IPs (or behind an ALB; the Terraform baseline is direct-to-EC2 — the customer can layer ALB on top).
CI/CD wiring. Customer CI builds container images on push to main, tags them <sha> and latest, pushes to ECR. SUT pulls and docker compose up -d. The customer can extend with blue/green by running two compose files behind a load balancer.
Smoke tests from the customer workstation: open swagger_url, sign in on dropbet_url, place a dice bet, confirm the trace in jaeger_url, confirm the service-overview dashboard populates in grafana_url.

Deliverables¶

Terraform state committed to the customer's S3 backend (the README ships a state section with the migration command).
ECR repositories with latest and a <sha> tag for each of the 7 images.
Doppler dev_stage config populated and audited; output of doppler secrets is signed off.
DNS records resolving + TLS green in the customer's browser.
A "stage walkthrough" recording or doc the customer's on-call team can replay.

Acceptance criteria¶

terraform plan shows zero diff after apply.
All 7 ECR repos have at least one tagged image.
End-to-end trace dropbet → ebit-api → Postgres + Redis visible in Jaeger.
Grafana ebit-perf-test dashboard populated.
See Acceptance Criteria → Phase 3.

Common pitfalls¶

Doppler config drift. Stage and perf configs diverge silently from each other; debug-only flags leak into prod. Mitigation: run the doppler-perf-audit playbook for each non-local config; keep the diff in a checked-in document.
Missing FastTrack stub keys. The env validator requires FASTTRACK_JWT_PRIVATE_KEY / FASTTRACK_JWT_PUBLIC_KEY even though the producer is disabled = true. App refuses to boot. Mitigation: stage cutover script populates stubs explicitly.
network_mode: host on admin-fe. The admin-fe container compose uses host networking as a workaround for hard-coded API host (see AF-1 in architecture.md). On AWS this means admin-fe must be on its own host or behind a separate routing layer. Mitigation: documented in terraform/perf/README.md; plan to fix in a code change before GA (Risks #6).
Public-CIDR slop. Default Terraform locks every external port to data.http.my_ip. If customer engineers behind dynamic ISPs run apply, the next engineer's CIDR overrides the first. Mitigation: explicit admin_ssh_cidrs / admin_http_cidrs lists in terraform.tfvars.

Phase 4 — Integration (2 / 4 / 6 wk)¶

Goal¶

Wire the customer-specific external dependencies: payments, KYC, captcha, email, brand, locale, feature flags. This is the single most variable phase; duration depends on how many vendors the customer chose in Phase 1 and how mature their accounts are.

Inputs¶

People: customer engineer per integration (payments lead, KYC lead, FE engineer for branding, ops engineer for flags), Evospin handover engineer for office hours.
Materials: sandbox credentials for every approved vendor from Phase 1; brand asset pack (logo, favicon, color tokens); locale string list; feature-flag matrix (UnleashClient configuration in apps/api/src/feature-flag/).

Activities¶

Payments. Configure deposit/withdraw adapter under apps/api/src/payment/ for each enabled provider. CCPAYMENT and NowPayments are the two paths exercised today. Add provider keys to Doppler. Test deposit + withdraw flows in stage with sandbox credentials. Hook the webhook URL on the provider side at https://{api-host}/payment/webhook/{provider}.
KYC. Configure Sumsub (or alternative) under apps/api/src/kyc/. Set webhook + redirect URLs; map level transitions (LEVEL_0 → 1 → 2) to deposit/withdraw caps. Test full upload-and-review flow against Sumsub sandbox.
Captcha. Issue a GeeTest (or reCAPTCHA Enterprise) production key. Set in Doppler. Verify the local-bypass is gone in prod: recaptcha.service.ts:28 gates the 'pass' token on isLocal === true. With NODE_ENV=production the bypass is off (per doppler-perf-audit.md). See Risks #8.
Email. Configure SendGrid (or alternative SMTP). Templates live in apps/api/src/external-notification-sender/. Send-from address must be on a domain the customer has verified with the provider (DKIM + SPF + DMARC).
Brand + locale. ebit-fe accepts brand tokens via Tailwind theme + asset path overrides. Locale strings live in messages/en.json and messages/de.json; adding a locale = add a JSON file + update next-intl config (see CLAUDE.md → ebit-fe specifics). Admin-fe is single-language.
Feature flags. Configure Unleash (or GitLab Feature Flags) under apps/api/src/feature-flag/. Decide initial flag posture for each known toggle (race leaderboards RACE_ENABLED, signup flow variants, sportsbook on/off).
Game catalog. Run npm run db:seed against stage with the customer's catalog filter; or use admin endpoints to enable/disable house games per category.

Deliverables¶

Each enabled vendor: signed sandbox-tested + webhook-tested + checked-into-Doppler.
Brand assets committed to ebit-fe under the customer's fork branch.
Feature-flag matrix signed off by customer ops.
Game catalog populated and visible on the dropbet stage URL.

Acceptance criteria¶

A test deposit completes and credits the test user's balance on stage.
A test KYC submission moves the user from LEVEL_0 to LEVEL_1 in the admin panel (or via API).
Brand tokens visible end-to-end on dropbet stage.
See Acceptance Criteria → Phase 4.

Common pitfalls¶

Vendor sandbox flakiness. Sumsub sandbox occasionally returns stale state for resubmissions; CCPAYMENT sandbox webhook IPs change. Mitigation: test from sandbox to stage and from stage to sandbox; build a webhook replay tool.
Email deliverability. Test mails land in spam because customer skipped DKIM/SPF/DMARC. Mitigation: spam-test before UAT; warming requires lead time.
Captcha key swap regression. Customer rotates GeeTest keys mid-Phase-4; sign-up breaks silently in stage. Mitigation: smoke-test sign-up after every Doppler change (see docs/runbooks/recaptcha-fails-locally.md).

Phase 5 — Performance validation (3 days / 1 wk / 1.5 wk)¶

Goal¶

Run the stepped-ramp protocol from docs/performance-testing.md against stage and produce a signed-off performance-test-report.md clone. Identify and assign owners for every SLO breach.

Inputs¶

People: customer SRE, Evospin handover engineer.
Materials: stage environment from Phase 3; load-gen host already provisioned by Terraform; tests-perf/ checked out; 10k seeded users (the perf wiring includes a seed script — see terraform/perf/README.md).

Activities¶

Apply the Doppler perf-config audit playbook. Confirm FASTTRACK_JWT_* stubs present, NODE_ENV=production, DEFAULT_LOG_LEVEL=info, DEBUG_LOGS_PRETTY=false.
Apply the kernel tuning checklist on the load-gen host.
Run smoke: k6 run --out experimental-prometheus-rw tests-perf/k6/smoke.js.
Run the full stepped ramp: k6 run --out experimental-prometheus-rw tests-perf/profiles/stepped-ramp.js.
For each breach, follow the bottleneck-hunting runbook §6. Capture: Grafana snapshot (5 min around breach), Jaeger exemplar trace IDs, k6 summary at abort.
Fill out the performance-test-report.md template. Replace every {{TBD}} with measurement or attribution.

Deliverables¶

Filled-out performance test report (committed to the customer fork).
Grafana dashboard snapshots for each stage and each breach.
Decision log for each SLO miss: re-baseline expectations OR file optimization ticket with owner.

Acceptance criteria¶

Test ran through stage 4 (5,000 VUs) without breaching auto-abort thresholds, or every breach has a documented owner.
See Acceptance Criteria → Phase 5.

Common pitfalls¶

Dice bet SLO miss is structural. Baseline p95 is 108 ms at 1 VU (per performance-test-report.md) — already over the 100 ms target before any load. Mitigation: either re-baseline the SLO to 150 ms (it's the transactional block in PlaceBetService, which is correct-by-design for atomicity) or schedule the optimization ticket. See Risks #7.
Cross-service trace gap. Inter-Nest-app RPC over Redis pub/sub does not propagate traceparent (AF-2 in architecture.md). Some breach attributions need a Loki pivot on userId+timestamp. Mitigation: document the blind spots; bottleneck-hunting runbook calls this out.
Co-located load-gen contention. If perf is run against the SUT from the same host, k6 and the services compete for CPU. Mitigation: default Terraform splits load-gen onto its own host; do not deviate.

Phase 6 — UAT (2 / 3 / 4 wk)¶

Goal¶

The customer drives end-to-end manual + scripted scenarios against stage, with real (sandbox) vendors, real domains, real branding. Defects are filed, triaged, and closed before pilot.

Inputs¶

People: customer QA team, customer product, customer support; Evospin handover engineer for triage office hours.
Materials: signed-off Phase-3 stage; signed-off Phase-4 vendor wiring; signed-off Phase-5 perf report; UAT script document (customer-owned).

Activities¶

Customer QA runs the UAT script. Recommend organizing as: registration & auth, deposit, KYC, place bet (each enabled game), withdraw, leaderboard, challenges/promos, support tools (admin user mgmt, bet review).
File defects in the customer's tracker. Triage with Evospin weekly. Each defect carries a flow-doc reference where applicable (anchor to docs/flows/<flow>.md).
Confirm every security finding is either fixed, accepted-with-mitigation, or has a forecasted fix date that precedes GA.
Customer support team validates runbooks against the stage environment — every runbook should be runnable solo.

Deliverables¶

UAT defect register, signed off by customer product as "ready for pilot".
Security register annotated with customer accept/fix decisions.
Support runbook validation log.

Acceptance criteria¶

Zero open Critical or High security findings without an accepted mitigation or scheduled fix.
Zero P0 defects open.
See Acceptance Criteria → Phase 6.

Common pitfalls¶

UAT becomes design iteration. Customer product team wants UI changes that should have been Phase-1 brand decisions. Mitigation: enforce scope from Discovery decision log; new asks become post-GA backlog.
Known-issue surprise. Customer testers find AF-1 (admin-fe blank-page-after-sign-in) and panic. Mitigation: pre-brief on the weaknesses-register.md; admin work is via Swagger today (Risks #5).

Phase 7 — Pilot launch (1 / 2 / 3 wk)¶

Goal¶

Limited rollout to a defined cohort (geographic, percentage, invite-list). Operate the platform with on-call coverage, watch dashboards, validate the runbooks in production conditions.

Inputs¶

People: customer on-call rotation (24/7 if applicable), customer support, Evospin on-call shadow.
Materials: production environment provisioned (mirror of stage but with prd Doppler config + production vendor credentials + production gaming license active in the targeted jurisdiction).
Decisions: pilot cohort definition (geography or invite list); rollback criteria.

Activities¶

Cut over DNS to production. Watch service-overview dashboard for the first hour.
Open the cohort: feature-flag toggle, geo-IP allowlist, or invite-code gate.
Daily on-call sync, watch bullmq.json, redis.json, prisma-postgres.json, browser-rum.json dashboards.
Run a real incident drill (e.g. inject a 500 on a non-critical route) so the customer team exercises Loki → Jaeger pivot, the trace-missing runbook, and the on-call escalation tree.
Track metric: error rate, p95 sign-in, p95 bet-place, queue depth, OOM kills, restart count.

Deliverables¶

Daily ops summary log (7 entries minimum).
Incident drill after-action report.
Updated runbook list (any gap found in pilot becomes a new runbook).

Acceptance criteria¶

Error rate < 0.1% sustained over 7 consecutive days.
p95 sign-in < 150 ms (or the renegotiated SLO if the dice-bet path was re-baselined in Phase 5).
Zero P0 incidents.
On-call runbook validated end-to-end at least once.
See Acceptance Criteria → Phase 7.

Common pitfalls¶

rt scaling. ClientGateway.clientSockets is per-instance; without sticky sessions or @socket.io/redis-adapter per-room emits drop (AF-3 in architecture.md). Mitigation: stay single-replica for pilot OR adopt the Redis adapter pre-pilot.
Online-count "spike". UsersOnlineUpdated includes fakeUserOnline padding (start 500, drift ±5, floor 180); customer ops sees "users online" even before any pilot users join. Mitigation: pre-brief; AF-5 in architecture.md.
Silent audit gap. safeLog swallows insert errors on admin_action_log. Mitigation: alert on Postgres error rate; document in docs/runbooks/ post-launch.

Phase 8 — GA (3 days / 1 wk / 2 wk)¶

Goal¶

Open the floodgates. Hand off operational ownership to the customer's on-call team. Schedule the post-handover review.

Inputs¶

People: customer on-call lead, customer support lead, Evospin delivery lead. Evospin handover engineer transitions to advisory only.
Materials: clean pilot exit (Phase 7 acceptance signed); communication plan for cohort expansion; updated marketing pages; legal pages live (terms, privacy).

Activities¶

Remove the pilot cohort gate. Watch for 24h.
Announce GA externally per the customer's marketing plan.
Walk through docs/runbooks/ one more time with the on-call team. Confirm escalation matrix.
Schedule a 30-day post-GA review: SLO compliance, defect rate, customer support ticket volume, change-request backlog.
Customer-side incident drill solo (Evospin observes only).
Hand over Sentry, Doppler, ECR, AWS account ownership artifacts in writing.

Deliverables¶

GA announcement (customer-owned).
Signed handover document covering: AWS, Doppler, Sentry, ECR, vendor accounts (KYC, payments, captcha, email), CI/CD repository write access.
Escalation matrix in the customer's wiki, exercised once.
30-day review meeting on the calendar.

Acceptance criteria¶

Customer on-call team has executed an incident drill solo without Evospin assistance.
Escalation matrix exercised end-to-end.
All vendor accounts re-keyed to customer-owned credentials (no ebit-personal tokens).
See Acceptance Criteria → Phase 8.

Common pitfalls¶

Soft handover. Evospin engineer answers customer pages directly post-GA; team never builds the muscle. Mitigation: "Evospin advises, customer responds" is a hard contract from GA day one.
Vendor credential leak. ebit-personal Doppler tokens still active. Mitigation: explicit credential audit in the launch checklist (see Launch Checklist → Secrets).
Skipped 30-day review. Issues accumulate silently. Mitigation: booked in calendar before GA day.