Phased Rollout¶
Eight phases from first kickoff call to General Availability. Each phase is a self-contained slice with a goal, duration estimate, hard inputs, numbered activities, deliverables (artifacts the customer keeps), acceptance criteria (objective and testable — see Acceptance Criteria), and the common pitfalls we have seen.
Phases are sequential by default. Phases 4 and 5 can overlap if the customer engineering team is sized to run two work-streams in parallel.
| # | Phase | Low / Typical / High |
|---|---|---|
| 1 | Discovery | 1 / 1.5 / 2 wk |
| 2 | Local stack | 1 / 3 / 5 days |
| 3 | Stage environment | 1 / 1.5 / 2 wk |
| 4 | Integration (payments / KYC / branding / flags) | 2 / 4 / 6 wk |
| 5 | Performance validation | 3 days / 1 wk / 1.5 wk |
| 6 | UAT | 2 / 3 / 4 wk |
| 7 | Pilot launch | 1 / 2 / 3 wk |
| 8 | GA | 3 days / 1 wk / 2 wk |
End-to-end typical path: ~14 calendar weeks. High path (with vendor approval lag, design iteration, and a re-baseline on perf): ~22 weeks.
Phase 1 — Discovery (1 / 1.5 / 2 wk)¶
Goal¶
Confirm the platform fits the customer's commercial, regulatory, and architectural envelope. Produce a written gap analysis the customer's leadership can sign off on. Kick off the long-lead-time vendor approvals (Sumsub KYC and any payment provider) so they finish before they block.
Inputs¶
- People: customer PM, customer architect, customer compliance / legal sponsor, Evospin delivery lead. One workshop facilitator on each side.
- Materials: customer's target-market list (jurisdictions), brand guidelines (logo, palette), planned currency + locale set, draft game catalog selection, planned payment methods.
- Decisions deferred: none — Discovery is precisely the place to surface and book them.
Activities¶
- Walk through the architecture doc, external services doc, and the 15 flow docs. Flag any flow that is out-of-scope for the customer's product (for example, the customer may not enable speed-roulette).
- Run the dependencies checklist end-to-end with the customer. For each row, confirm "have / will obtain / N/A".
- Make the scope decisions that block downstream phases:
- Game catalog (which house games, sportsbook on/off, blackjack on/off, speed-roulette on/off — see AF-4 in
architecture.mdre: orphanebit-bjapp). - Game-provider integrations (PM8 / Softswiss / both / neither — out-of-the-box catalog only).
- Payment provider stack (CCPAYMENT and/or NowPayments and/or {{TBD: customer-preferred provider}}).
- KYC vendor (Sumsub vs alternative — see Risks #3).
- Locale set (en, de ship today; additional locales add a Phase-4 task).
- Currency set (DBC native + which fiat / crypto pairs).
- Open the long-lead vendor accounts in parallel:
- Sumsub sandbox application (typical 2–4 wk approval).
- Payment provider sandbox application (varies by provider; CCPAYMENT typical 1–2 wk).
- GeeTest / reCAPTCHA enterprise key (same-day if customer already has Google Cloud).
- SendGrid / SMTP provider (same-day).
- {{TBD: gaming license — jurisdiction dependent; this can run for months and is the single biggest schedule risk}}.
- Plan the customer's AWS landing zone: account, VPC strategy, KMS key for Doppler, IAM principals, certificate authority. Reference
terraform/perf/README.mdfor the Phase-3 baseline.
Deliverables¶
- Gap analysis document (customer-owned). For each subsystem in
architecture.md, one of: "in scope, no change", "in scope, needs config", "in scope, needs code change", "out of scope". - Scope decision log (customer-owned). One row per decision listed in Activities §3, with owner, decision, date.
- Vendor request tracker. Each long-lead account (KYC, payments, captcha, email, license) with state
requested / pending-review / approved / rejectedand an expected-by date. - Provisional schedule. Working assumption for Phases 2–8 with the customer's named owners.
Acceptance criteria¶
- Scope decision log signed by customer PM and customer compliance sponsor.
- All long-lead vendor requests submitted (state
requestedor later) — none innot yet asked. - See Acceptance Criteria → Phase 1.
Common pitfalls¶
- Skipping the gap analysis. Customer assumes the platform "just runs" against their existing payment processor; Phase 4 then surfaces a 6-week integration. Mitigation: make the gap analysis a hard exit gate.
- Letting KYC slip to Phase 4. Sumsub approval typically arrives 2–4 weeks after submission; if requested in Phase 4 it blocks UAT. Mitigation: open the vendor requests on Day 1 of Phase 1.
- Soft scope on game catalog. "We'll decide later" pushes a multi-week integration into Phase 6. Mitigation: Activity §3 is a hard checklist with named decision owners.
Phase 2 — Local stack (1 / 3 / 5 days)¶
Goal¶
Customer engineering team has the entire Evospin stack running on each developer laptop. A test bet is placed end-to-end and a trace appears in Jaeger. This phase exists so the customer's engineers learn the system before they have to operate it.
Inputs¶
- People: every customer engineer who will touch the codebase, plus one Evospin handover engineer for office hours.
- Materials: customer-issued laptops with at least 16 GB RAM and 40 GB free disk; Docker 24+, Node 22, pnpm 9.11, npm 10+, git 2.40+.
- Optional: customer Doppler workspace (otherwise the
.env.exampleplain-.envpath works).
Activities¶
- Each engineer follows
docs/onboarding/day-one.mdend-to-end. Stop at the §10 exit checklist. - Pair-walk through the bet pipeline using
docs/flows/dropbet-bet-place.md, thendocs/flows/dropbet-sign-in.mdanddocs/flows/rt-websocket.md. The architecture doc's "read order suggestions for new hires" section names which three flows fit each role. - Run the cross-service Playwright canary:
cd tests-e2e && pnpm test. Every spec referenced by the architecture doc §4 should pass against the local stack. - Walk through the runbooks library —
bullmq-job-stuck.md,trace-missing.md,loki-no-logs.md,recaptcha-fails-locally.md— so the team has hit each at least once before stage. - Skim
docs/recipes/so the team knows where the patterns for adding a REST endpoint, BullMQ queue, OTel span, RT socket event, Prisma model, Grafana dashboard, and Playwright spec live.
Deliverables¶
- Each engineer has placed a bet locally and seen a trace.
- Customer-side notes / wiki entry capturing any environment-specific deviations (proxy, certificate trust, IDE config) for the next person who joins the team.
Acceptance criteria¶
docker compose psshows all services healthy with zero restarts on each engineer's machine.- Cross-service Playwright suite green on at least one customer-side machine.
- See Acceptance Criteria → Phase 2.
Common pitfalls¶
- First build OOMs on 8 GB laptops. Build peak is ~7 GB. Mitigation: require 16 GB; document workaround (build images one-at-a-time on memory-constrained machines).
- Engineers ignore the
pnpmvsnpmdiscipline. Mixing corrupts lockfiles. Mitigation: pin in CONTRIBUTING.md; lint-staged catches at commit. - Customer expects
npx prismato work. Always wrapped withenv-cmd— seeday-one.md§9. Mitigation: runbook entry; the npm scripts are the only supported path.
Phase 3 — Stage environment (1 / 1.5 / 2 wk)¶
Goal¶
Provision a production-shaped stage environment in the customer's AWS account using the published Terraform. End-to-end traffic from the customer's own DNS reaches dropbet, traces appear in Jaeger, dashboards populate in Grafana.
Inputs¶
- People: customer SRE / platform engineer, customer DNS owner, Evospin handover engineer.
- Materials: AWS account with the IAM permissions in
terraform/perf/README.md; existing VPC and subnet IDs; EC2 key pair; admin CIDR (workstation IPs); domain names for dropbet, admin, Grafana, Jaeger; ACM or upstream-issued TLS certificates. - Decisions: Doppler workspace structure (one config per environment:
dev,dev_stage,dev_perf,prd); KMS key for Doppler; ECR repo policy.
Activities¶
- Terraform apply per
terraform/perf/README.md. Three arm64 EC2 hosts: SUT (c7g.4xlarge), monitoring (c7g.2xlarge), load-gen (c7g.4xlarge); one ECR repo per service. - Build + push all 7 service images (
api,rt,bj,bo,speed-roulette,ebit-fe,ebit-admin-fe) to ECR per the README §1. - Doppler perf-config setup. Mirror the audit in
docs/audits/doppler-perf-audit.mdfor stage:NODE_ENV=production,DEFAULT_LOG_LEVEL=info, all 17 critical secrets populated,FASTTRACK_JWT_*stub values until Fast Track is enabled. - DNS + TLS. Point
dropbet.{customer},admin.{customer},grafana.{customer},jaeger.{customer}at the SUT and monitoring public IPs (or behind an ALB; the Terraform baseline is direct-to-EC2 — the customer can layer ALB on top). - CI/CD wiring. Customer CI builds container images on push to
main, tags them<sha>andlatest, pushes to ECR. SUT pulls anddocker compose up -d. The customer can extend with blue/green by running two compose files behind a load balancer. - Smoke tests from the customer workstation: open
swagger_url, sign in ondropbet_url, place a dice bet, confirm the trace injaeger_url, confirm theservice-overviewdashboard populates ingrafana_url.
Deliverables¶
- Terraform state committed to the customer's S3 backend (the README ships a
statesection with the migration command). - ECR repositories with
latestand a<sha>tag for each of the 7 images. - Doppler
dev_stageconfig populated and audited; output ofdoppler secretsis signed off. - DNS records resolving + TLS green in the customer's browser.
- A "stage walkthrough" recording or doc the customer's on-call team can replay.
Acceptance criteria¶
terraform planshows zero diff afterapply.- All 7 ECR repos have at least one tagged image.
- End-to-end trace
dropbet → ebit-api → Postgres + Redisvisible in Jaeger. - Grafana
ebit-perf-testdashboard populated. - See Acceptance Criteria → Phase 3.
Common pitfalls¶
- Doppler config drift. Stage and perf configs diverge silently from each other; debug-only flags leak into prod. Mitigation: run the doppler-perf-audit playbook for each non-local config; keep the diff in a checked-in document.
- Missing FastTrack stub keys. The env validator requires
FASTTRACK_JWT_PRIVATE_KEY/FASTTRACK_JWT_PUBLIC_KEYeven though the producer isdisabled = true. App refuses to boot. Mitigation: stage cutover script populates stubs explicitly. network_mode: hoston admin-fe. The admin-fe container compose uses host networking as a workaround for hard-coded API host (see AF-1 inarchitecture.md). On AWS this means admin-fe must be on its own host or behind a separate routing layer. Mitigation: documented interraform/perf/README.md; plan to fix in a code change before GA (Risks #6).- Public-CIDR slop. Default Terraform locks every external port to
data.http.my_ip. If customer engineers behind dynamic ISPs runapply, the next engineer's CIDR overrides the first. Mitigation: explicitadmin_ssh_cidrs/admin_http_cidrslists interraform.tfvars.
Phase 4 — Integration (2 / 4 / 6 wk)¶
Goal¶
Wire the customer-specific external dependencies: payments, KYC, captcha, email, brand, locale, feature flags. This is the single most variable phase; duration depends on how many vendors the customer chose in Phase 1 and how mature their accounts are.
Inputs¶
- People: customer engineer per integration (payments lead, KYC lead, FE engineer for branding, ops engineer for flags), Evospin handover engineer for office hours.
- Materials: sandbox credentials for every approved vendor from Phase 1; brand asset pack (logo, favicon, color tokens); locale string list; feature-flag matrix (
UnleashClientconfiguration inapps/api/src/feature-flag/).
Activities¶
- Payments. Configure deposit/withdraw adapter under
apps/api/src/payment/for each enabled provider. CCPAYMENT and NowPayments are the two paths exercised today. Add provider keys to Doppler. Test deposit + withdraw flows in stage with sandbox credentials. Hook the webhook URL on the provider side athttps://{api-host}/payment/webhook/{provider}. - KYC. Configure Sumsub (or alternative) under
apps/api/src/kyc/. Set webhook + redirect URLs; map level transitions (LEVEL_0 → 1 → 2) to deposit/withdraw caps. Test full upload-and-review flow against Sumsub sandbox. - Captcha. Issue a GeeTest (or reCAPTCHA Enterprise) production key. Set in Doppler. Verify the local-bypass is gone in prod:
recaptcha.service.ts:28gates the'pass'token onisLocal === true. WithNODE_ENV=productionthe bypass is off (perdoppler-perf-audit.md). See Risks #8. - Email. Configure SendGrid (or alternative SMTP). Templates live in
apps/api/src/external-notification-sender/. Send-from address must be on a domain the customer has verified with the provider (DKIM + SPF + DMARC). - Brand + locale. ebit-fe accepts brand tokens via Tailwind theme + asset path overrides. Locale strings live in
messages/en.jsonandmessages/de.json; adding a locale = add a JSON file + updatenext-intlconfig (see CLAUDE.md → ebit-fe specifics). Admin-fe is single-language. - Feature flags. Configure Unleash (or GitLab Feature Flags) under
apps/api/src/feature-flag/. Decide initial flag posture for each known toggle (race leaderboardsRACE_ENABLED, signup flow variants, sportsbook on/off). - Game catalog. Run
npm run db:seedagainst stage with the customer's catalog filter; or use admin endpoints to enable/disable house games per category.
Deliverables¶
- Each enabled vendor: signed sandbox-tested + webhook-tested + checked-into-Doppler.
- Brand assets committed to ebit-fe under the customer's fork branch.
- Feature-flag matrix signed off by customer ops.
- Game catalog populated and visible on the dropbet stage URL.
Acceptance criteria¶
- A test deposit completes and credits the test user's balance on stage.
- A test KYC submission moves the user from LEVEL_0 to LEVEL_1 in the admin panel (or via API).
- Brand tokens visible end-to-end on dropbet stage.
- See Acceptance Criteria → Phase 4.
Common pitfalls¶
- Vendor sandbox flakiness. Sumsub sandbox occasionally returns stale state for resubmissions; CCPAYMENT sandbox webhook IPs change. Mitigation: test from sandbox to stage and from stage to sandbox; build a webhook replay tool.
- Email deliverability. Test mails land in spam because customer skipped DKIM/SPF/DMARC. Mitigation: spam-test before UAT; warming requires lead time.
- Captcha key swap regression. Customer rotates GeeTest keys mid-Phase-4; sign-up breaks silently in stage. Mitigation: smoke-test sign-up after every Doppler change (see
docs/runbooks/recaptcha-fails-locally.md).
Phase 5 — Performance validation (3 days / 1 wk / 1.5 wk)¶
Goal¶
Run the stepped-ramp protocol from docs/performance-testing.md against stage and produce a signed-off performance-test-report.md clone. Identify and assign owners for every SLO breach.
Inputs¶
- People: customer SRE, Evospin handover engineer.
- Materials: stage environment from Phase 3; load-gen host already provisioned by Terraform;
tests-perf/checked out; 10k seeded users (the perf wiring includes a seed script — seeterraform/perf/README.md).
Activities¶
- Apply the Doppler perf-config audit playbook. Confirm
FASTTRACK_JWT_*stubs present,NODE_ENV=production,DEFAULT_LOG_LEVEL=info,DEBUG_LOGS_PRETTY=false. - Apply the kernel tuning checklist on the load-gen host.
- Run smoke:
k6 run --out experimental-prometheus-rw tests-perf/k6/smoke.js. - Run the full stepped ramp:
k6 run --out experimental-prometheus-rw tests-perf/profiles/stepped-ramp.js. - For each breach, follow the bottleneck-hunting runbook §6. Capture: Grafana snapshot (5 min around breach), Jaeger exemplar trace IDs, k6 summary at abort.
- Fill out the
performance-test-report.mdtemplate. Replace every{{TBD}}with measurement or attribution.
Deliverables¶
- Filled-out performance test report (committed to the customer fork).
- Grafana dashboard snapshots for each stage and each breach.
- Decision log for each SLO miss: re-baseline expectations OR file optimization ticket with owner.
Acceptance criteria¶
- Test ran through stage 4 (5,000 VUs) without breaching auto-abort thresholds, or every breach has a documented owner.
- See Acceptance Criteria → Phase 5.
Common pitfalls¶
- Dice bet SLO miss is structural. Baseline p95 is 108 ms at 1 VU (per
performance-test-report.md) — already over the 100 ms target before any load. Mitigation: either re-baseline the SLO to 150 ms (it's the transactional block inPlaceBetService, which is correct-by-design for atomicity) or schedule the optimization ticket. See Risks #7. - Cross-service trace gap. Inter-Nest-app RPC over Redis pub/sub does not propagate
traceparent(AF-2 inarchitecture.md). Some breach attributions need a Loki pivot onuserId+timestamp. Mitigation: document the blind spots; bottleneck-hunting runbook calls this out. - Co-located load-gen contention. If perf is run against the SUT from the same host, k6 and the services compete for CPU. Mitigation: default Terraform splits load-gen onto its own host; do not deviate.
Phase 6 — UAT (2 / 3 / 4 wk)¶
Goal¶
The customer drives end-to-end manual + scripted scenarios against stage, with real (sandbox) vendors, real domains, real branding. Defects are filed, triaged, and closed before pilot.
Inputs¶
- People: customer QA team, customer product, customer support; Evospin handover engineer for triage office hours.
- Materials: signed-off Phase-3 stage; signed-off Phase-4 vendor wiring; signed-off Phase-5 perf report; UAT script document (customer-owned).
Activities¶
- Customer QA runs the UAT script. Recommend organizing as: registration & auth, deposit, KYC, place bet (each enabled game), withdraw, leaderboard, challenges/promos, support tools (admin user mgmt, bet review).
- File defects in the customer's tracker. Triage with Evospin weekly. Each defect carries a flow-doc reference where applicable (anchor to
docs/flows/<flow>.md). - Confirm every security finding is either fixed, accepted-with-mitigation, or has a forecasted fix date that precedes GA.
- Customer support team validates runbooks against the stage environment — every runbook should be runnable solo.
Deliverables¶
- UAT defect register, signed off by customer product as "ready for pilot".
- Security register annotated with customer accept/fix decisions.
- Support runbook validation log.
Acceptance criteria¶
- Zero open Critical or High security findings without an accepted mitigation or scheduled fix.
- Zero P0 defects open.
- See Acceptance Criteria → Phase 6.
Common pitfalls¶
- UAT becomes design iteration. Customer product team wants UI changes that should have been Phase-1 brand decisions. Mitigation: enforce scope from Discovery decision log; new asks become post-GA backlog.
- Known-issue surprise. Customer testers find AF-1 (admin-fe blank-page-after-sign-in) and panic. Mitigation: pre-brief on the
weaknesses-register.md; admin work is via Swagger today (Risks #5).
Phase 7 — Pilot launch (1 / 2 / 3 wk)¶
Goal¶
Limited rollout to a defined cohort (geographic, percentage, invite-list). Operate the platform with on-call coverage, watch dashboards, validate the runbooks in production conditions.
Inputs¶
- People: customer on-call rotation (24/7 if applicable), customer support, Evospin on-call shadow.
- Materials: production environment provisioned (mirror of stage but with
prdDoppler config + production vendor credentials + production gaming license active in the targeted jurisdiction). - Decisions: pilot cohort definition (geography or invite list); rollback criteria.
Activities¶
- Cut over DNS to production. Watch
service-overviewdashboard for the first hour. - Open the cohort: feature-flag toggle, geo-IP allowlist, or invite-code gate.
- Daily on-call sync, watch
bullmq.json,redis.json,prisma-postgres.json,browser-rum.jsondashboards. - Run a real incident drill (e.g. inject a 500 on a non-critical route) so the customer team exercises Loki → Jaeger pivot, the trace-missing runbook, and the on-call escalation tree.
- Track metric: error rate, p95 sign-in, p95 bet-place, queue depth, OOM kills, restart count.
Deliverables¶
- Daily ops summary log (7 entries minimum).
- Incident drill after-action report.
- Updated runbook list (any gap found in pilot becomes a new runbook).
Acceptance criteria¶
- Error rate < 0.1% sustained over 7 consecutive days.
- p95 sign-in < 150 ms (or the renegotiated SLO if the dice-bet path was re-baselined in Phase 5).
- Zero P0 incidents.
- On-call runbook validated end-to-end at least once.
- See Acceptance Criteria → Phase 7.
Common pitfalls¶
- rt scaling.
ClientGateway.clientSocketsis per-instance; without sticky sessions or@socket.io/redis-adapterper-room emits drop (AF-3 inarchitecture.md). Mitigation: stay single-replica for pilot OR adopt the Redis adapter pre-pilot. - Online-count "spike".
UsersOnlineUpdatedincludesfakeUserOnlinepadding (start 500, drift ±5, floor 180); customer ops sees "users online" even before any pilot users join. Mitigation: pre-brief; AF-5 inarchitecture.md. - Silent audit gap.
safeLogswallows insert errors onadmin_action_log. Mitigation: alert on Postgres error rate; document indocs/runbooks/post-launch.
Phase 8 — GA (3 days / 1 wk / 2 wk)¶
Goal¶
Open the floodgates. Hand off operational ownership to the customer's on-call team. Schedule the post-handover review.
Inputs¶
- People: customer on-call lead, customer support lead, Evospin delivery lead. Evospin handover engineer transitions to advisory only.
- Materials: clean pilot exit (Phase 7 acceptance signed); communication plan for cohort expansion; updated marketing pages; legal pages live (terms, privacy).
Activities¶
- Remove the pilot cohort gate. Watch for 24h.
- Announce GA externally per the customer's marketing plan.
- Walk through
docs/runbooks/one more time with the on-call team. Confirm escalation matrix. - Schedule a 30-day post-GA review: SLO compliance, defect rate, customer support ticket volume, change-request backlog.
- Customer-side incident drill solo (Evospin observes only).
- Hand over Sentry, Doppler, ECR, AWS account ownership artifacts in writing.
Deliverables¶
- GA announcement (customer-owned).
- Signed handover document covering: AWS, Doppler, Sentry, ECR, vendor accounts (KYC, payments, captcha, email), CI/CD repository write access.
- Escalation matrix in the customer's wiki, exercised once.
- 30-day review meeting on the calendar.
Acceptance criteria¶
- Customer on-call team has executed an incident drill solo without Evospin assistance.
- Escalation matrix exercised end-to-end.
- All vendor accounts re-keyed to customer-owned credentials (no ebit-personal tokens).
- See Acceptance Criteria → Phase 8.
Common pitfalls¶
- Soft handover. Evospin engineer answers customer pages directly post-GA; team never builds the muscle. Mitigation: "Evospin advises, customer responds" is a hard contract from GA day one.
- Vendor credential leak. ebit-personal Doppler tokens still active. Mitigation: explicit credential audit in the launch checklist (see Launch Checklist → Secrets).
- Skipped 30-day review. Issues accumulate silently. Mitigation: booked in calendar before GA day.