Delivery Risks¶
The 18 specific delivery risks we have either seen materialize or have the strongest evidence will materialize. Each row carries:
- Probability (L / M / H) — how often this bites in similar deliveries.
- Impact (L / M / H) — what it costs if it bites (schedule slip, scope cut, defect class).
- Mitigation — concrete action with owner.
- Owner — who runs the mitigation (
customer= customer-side;ebit= Evospin handover engineer;joint= both).
Risks are not generic ("things might be slow"). They are tied to specific code paths, specific runbooks, or specific vendor pipelines.
⚠ The top three to brief the customer on before quoting a timeline: Risk #1 (Doppler dry-run), Risk #3 (KYC vendor lead time), Risk #7 (dice-bet SLO is structurally unmet at 1 VU).
Risk #1 — Doppler workspace mis-config¶
| Field | Value |
|---|---|
| Probability | H — observed in docs/audits/doppler-perf-audit.md (perf cutover audit found 2 showstoppers + 1 pending issue in 121 keys) |
| Impact | H — boot crash on first stage/prod deploy; if undiagnosed, manifests as silent feature regression (e.g. captcha bypass off, but seed script depends on it) |
What goes wrong. Doppler config diverges silently between dev, dev_stage, dev_perf, prd. The env validator (@IsString(), no @IsOptional(), skipMissingProperties: false) rejects boot if a single required key is missing — for example, FASTTRACK_JWT_PRIVATE_KEY and FASTTRACK_JWT_PUBLIC_KEY are validator-required even though the producer is disabled = true.
Mitigation.
1. Run the doppler-perf-audit playbook for every non-local config. Diff against dev and against .example.env.
2. Add a CI job that runs the env validator in dry-run mode against a Doppler config snapshot before promoting it.
3. Build a "secret rotation runbook" before GA. The current runbooks library does not yet have one — see post-handover backlog.
Owner. customer (customer SRE owns Doppler post-handover); ebit runs the first audit during Phase 3.
Risk #2 — Game-provider integration scope creep (PM8 vs Softswiss vs both)¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | H — adds 4–8 weeks to Phase 4 if decided late |
What goes wrong. Customer commits in Phase 1 to "house games only", then in Phase 6 (UAT) the customer's commercial team requires Softswiss integration to satisfy a deal. There is no Softswiss adapter today; building one is a new module under apps/api/src/casino/.
Mitigation. 1. Phase-1 scope decision is binary and documented in the decision log; date it; sign it. 2. Reservations price = "any change to game-provider scope after Discovery resets the Phase-4 timeline". 3. If a provider must be added mid-delivery, freeze Phase 5 and schedule a separate Phase-4b.
Owner. joint — customer commercial commits; Evospin holds the line in scope review.
Risk #3 — KYC vendor approval timeline (Sumsub typical 2–4 weeks)¶
| Field | Value |
|---|---|
| Probability | H |
| Impact | M — blocks Phase 4 and Phase 6 if not started in Discovery |
What goes wrong. Sumsub sandbox application sits in "pending review" for 2–4 weeks. Customer assumes KYC integration is a Phase-4 task and only opens the application in Phase 4 — Phase 4 then idles waiting on vendor approval.
Mitigation. 1. Open Sumsub (and any alternative vendor) on Day 1 of Discovery (Phase 1 Activity §4). 2. While waiting for approval, build the integration against the provider's public sandbox documentation. 3. Have a fallback identified: Veriff or Onfido. Building a new adapter is ~1 week of engineer time vs the multi-week wait.
Owner. customer (customer compliance opens the application); ebit advises on technical fallback.
Risk #4 — Tracing transport gap on Redis pub/sub RPC¶
| Field | Value |
|---|---|
| Probability | H — already true today (AF-2 in architecture.md) |
| Impact | M — debug productivity loss during incidents; cross-service latency breakdown requires Loki pivot |
What goes wrong. Inter-Nest-app RPC over Redis pub/sub (@ExternalControllerClient, @GatewayMethod, @MessagePattern) does not propagate W3C traceparent. The callee emits orphan root traces. During a production incident, "show me the trace from sign-in through to the rt push" is a Loki-pivot exercise (correlate on userId + timestamp), not a single Jaeger search.
Mitigation.
1. Document the blind spots explicitly during handover (already done in weaknesses-register.md AF-2).
2. The trace-missing runbook covers the Loki-pivot workaround.
3. Plan to add traceparent propagation to the pub/sub transport post-launch. Ticket scope: ~1 engineer-week. Project tracking memo lives in project_otel_microservice_transport_gap per architecture doc.
Owner. ebit documents pre-handover; customer schedules the post-handover fix or accepts permanently.
Risk #5 — Admin-fe auth bug cluster¶
| Field | Value |
|---|---|
| Probability | H — already true today (AF-1 in architecture.md) |
| Impact | M — admin operations work via Swagger or E2E specs, but customer support team UX is degraded |
What goes wrong. Four stacked bugs in ebit-admin-fe: (a) cookie-name mismatch (access_token vs jwt_access_token); (b) no @vercel/otel instrumentation; (c) no propagateContextUrls for traceparent injection; (d) hard-coded API host. Result: after sign-in the admin-fe shows a blank page; cross-service traces are invisible. ebit-admin-fe/src/middleware.ts:59-60 is the cookie culprit.
Mitigation options.
1. Pre-handover fix. Estimated 2–3 engineer-days. Recommended, especially if customer support team will use the admin UI heavily.
2. Explicit acceptance. Customer support trained to use Swagger or a dedicated CLI; admin operations via POST /admin/* with JWT in Authorization header.
3. Document workaround if neither: customer engineering forks the admin-fe and applies a patch.
Owner. joint — customer decides which option; ebit either ships the fix or supplies the workaround doc.
Risk #6 — ebit-bj orphan app¶
| Field | Value |
|---|---|
| Probability | H — already true (AF-4) |
| Impact | L — cosmetic confusion in topology diagrams; small operational surface that is unmonitored |
What goes wrong. apps/bj/ runs on port 4002 with full blackjack + EVO-wallet RPC, but the dropbet client uses ebit-api's /casino/games/house/blackjack/* exclusively. The compose file exposes the port; the service receives zero in-repo traffic. New engineers see "blackjack server running" in docker compose ps and assume the dropbet flow goes through it (it doesn't).
Mitigation.
1. Make the scope decision in Discovery. Two options: (a) deprecate — remove from compose, archive apps/bj/; (b) wire up — replace ebit-api's in-process blackjack with the ebit-bj service.
2. Recommended: deprecate. The orphan adds maintenance surface (build, CI, Doppler keys, EVO wallet integration) without exercising it.
3. Document the decision in the customer fork's CLAUDE.md so the orphan is not re-discovered as "unused service" in a future audit.
Owner. joint — customer makes the call; ebit executes the chosen path.
Risk #7 — Dice bet SLO unmet at 1 VU¶
| Field | Value |
|---|---|
| Probability | H — measured today (docs/performance-test-report.md) |
| Impact | M — Phase 5 SLO miss; if not reframed, customer reads "perf doesn't meet target" as a P0 |
What goes wrong. POST /casino/games/house/dice/bet p95 is 108 ms at 1 VU, already over the 100 ms SLO target before any load. Latency is dominated by the Prisma transactional block in PlaceBetService: user seed pop, game identity lookup, self-exclusion check, balance update, two transaction creates, bet create, and nonce increment — all inside a single serializable transaction. Trace 8aaa902b3964af1d33dec7000bb36e02 (102 spans) is the authoritative reference.
Mitigation options. 1. Re-baseline expectations. SLO becomes 150 ms (matches sign-in budget; the transactional design is correct-by-design for atomicity). Defer optimization. 2. Optimize the transactional block. Hoist the read-only steps (game identity lookup, self-exclusion check) out of the transaction; convert two transaction-create rows into a batch insert. Estimated 1 engineer-week, with regression risk on fairness invariants.
Mitigation. - Brief the customer on this before Phase 5 starts. - Add the decision to the Phase-5 sign-off doc.
Owner. joint — customer decides re-baseline vs optimize; ebit executes if optimization is chosen.
Risk #8 — Captcha bypass tied to NODE_ENV¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | H if undetected — sign-up open to bot abuse |
What goes wrong. recaptcha.service.ts:28 gates the 'pass' token bypass on isLocal === true. With NODE_ENV=production (per docs/audits/doppler-perf-audit.md Phase-3 audit) the bypass is off. The seed script tests-perf/seed/seed-users.mjs depends on the bypass; running it post-NODE_ENV-flip fails. Worse, if the customer mis-sets NODE_ENV=local in production Doppler (e.g. typo in promotion script), the bypass is on and reCAPTCHA is silently disabled in prod.
Mitigation.
1. Verify production GeeTest / reCAPTCHA Enterprise credentials are populated before Phase 7 cutover (see Launch Checklist → Secrets).
2. Add a CI assertion: prd Doppler config must have NODE_ENV=production and a real captcha key — fail promotion otherwise.
3. Order Phase 5 perf seed before the NODE_ENV flip, or supply an alternate seed bypass.
Owner. customer SRE post-handover; ebit adds the CI assertion before handover.
Risk #9 — RT WebSocket horizontal scaling¶
| Field | Value |
|---|---|
| Probability | M — only triggers if customer scales rt past one replica |
| Impact | H — silently dropped per-room emits; users miss balance/bet events |
What goes wrong. ClientGateway.clientSockets is a per-instance Map. Scaling rt past one replica without sticky-routing OR @socket.io/redis-adapter silently drops message.user-targeted emits and breaks per-room joins (AF-3 in architecture.md, SF-016 in docs/security-register.md).
Mitigation.
1. Stay single-replica through pilot.
2. Pre-GA: either adopt the Redis adapter (estimated 3–5 engineer-days) or configure sticky sessions at the load balancer.
3. Monitor rt socket count and per-instance memory in Grafana; set an alert at 80% of expected single-instance capacity.
Owner. customer SRE; ebit advises on adapter integration if customer chooses option 2.
Risk #10 — RabbitMQ stub silently drops events¶
| Field | Value |
|---|---|
| Probability | H — already true (AF-6) |
| Impact | L for launch (Fast Track CRM is not customer-facing); M if customer requires bonus tracking |
What goes wrong. apps/api/src/fast-track/rabbitmq/fast-track.rmq.module.ts:8 returns disabled = true. 11 producer call sites (bet.service.ts ×4 settlement paths + promo-effect.service.ts ×7 reward handlers) fire into a no-op stub — every settled bet and every promo effect silently drops its Fast Track event.
Mitigation.
1. Discovery-phase decision. Either (a) wire Fast Track CRM (requires FASTTRACK_JWT_PRIVATE_KEY/FASTTRACK_JWT_PUBLIC_KEY + sandbox; ~2 engineer-weeks) or (b) rip the producer classes + 11 call sites entirely (~1 engineer-week).
2. Default recommendation: rip, unless the customer has explicit Fast Track CRM commitment.
Owner. joint.
Risk #11 — Online-count inflation surprises customer ops¶
| Field | Value |
|---|---|
| Probability | H — already true (AF-5) |
| Impact | L — cosmetic but credibility-eroding |
What goes wrong. apps/api/src/user/online-tracker.service.ts:30-46 broadcasts zcard(ONLINE_USERS_KEY) + fakeUserOnline every 10s via Server.UsersOnlineUpdated. fakeUserOnline init 500, drifts ±5, floor 180. Customer ops looks at the dashboard before any real users join and sees "500 users online" — and stops trusting the metric.
Mitigation.
1. Pre-brief: include in the Phase-3 stage walkthrough and Phase-7 pre-pilot brief.
2. Decide pre-launch: either remove fakeUserOnline (1 day work) or expose the raw zcard via a separate metric for ops dashboards.
Owner. joint.
Risk #12 — Critical security findings unfixed at GA¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | H — SF-008 anonymous read of seed material, SF-013 negative-balance via to-vault, FM-C-1 promo claim 404 |
What goes wrong. docs/security-register.md lists three Critical findings; if any remains open at GA, the customer carries financial-loss risk (SF-013) or fairness-leak risk (SF-008).
Mitigation.
1. Phase 6 (UAT) hard gate: zero open Critical or High findings without an accepted mitigation or scheduled fix.
2. The acceptance criteria for Phase 6 explicitly checks the security register state.
3. SF-013 (to-vault overdraft) is a 1-day fix; should be done pre-handover.
Owner. ebit for pre-handover fixes; customer accepts or schedules anything deferred.
Risk #13 — ECR image storage cost growth¶
| Field | Value |
|---|---|
| Probability | L — mitigation already in place |
| Impact | L — small cost line item if mitigation is removed |
What goes wrong. Without a lifecycle policy, ECR images accumulate one tag per CI build forever; storage cost grows linearly.
Mitigation.
- Already in place via terraform/perf/ lifecycle policy: keep 10 tagged + expire untagged after 7 days.
- Customer SRE preserves the policy in any Terraform fork.
Owner. customer SRE.
Risk #14 — Email deliverability failure pre-GA¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | M — sign-up verification mails land in spam; users can't activate |
What goes wrong. Customer skips DKIM / SPF / DMARC on the sending domain; SendGrid (or alt SMTP) reputation is zero on a fresh domain.
Mitigation. 1. Phase-4 task: customer DNS owner adds DKIM, SPF, DMARC records. 2. Spam-test before UAT (services like mail-tester.com). 3. Warm the sending domain over Phase 4–5 with low-volume internal emails.
Owner. customer (DNS + email).
Risk #15 — Postgres connection pool saturation under load¶
| Field | Value |
|---|---|
| Probability | M at scale (5k+ VUs) |
| Impact | H — p95 spike across all routes |
What goes wrong. Default Prisma connection pool is ~10 connections per service instance. With 5 NestJS apps × N replicas × 10 = quickly fills the Postgres max_connections (default 100). Symptom: prisma:engine:db_query p95 climbs across the board (per docs/performance-testing.md §6).
Mitigation.
1. Provision Postgres with higher max_connections (300+) and use a pooler (PgBouncer in transaction mode or AWS RDS Proxy).
2. Tune ?connection_limit=N per DATABASE_URL.
3. Phase 5 perf validation surfaces this; ticket into Phase 4b.
Owner. customer Postgres DBA + SRE.
Risk #16 — Loki retention not configured¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | L monetary, M compliance — unbounded log retention can violate data-retention policy |
What goes wrong. Default Loki config retains logs forever; customer's data retention policy may require 30 / 90 / 180 day caps; logs include PII (emails in sign-in failures).
Mitigation.
1. Configure Loki retention before Phase 7 (see observability/ config).
2. Confirm log redaction for sensitive fields in pino formatter.
3. Loki log-sourcing covers EvoLogger via filelog/docker receiver (per AF-7 in architecture.md) — verify retention applies uniformly.
Owner. customer SRE.
Risk #17 — Skipped 30-day post-GA review¶
| Field | Value |
|---|---|
| Probability | M |
| Impact | M — defects accumulate; customer team builds workarounds instead of fixes |
What goes wrong. Phase 8 hands over; Evospin advisory ends; customer team is firefighting; the 30-day review never gets booked.
Mitigation. 1. Book the calendar invite before GA day. 2. Evospin delivery lead carries the review even if advisory contract is closed. 3. Standing agenda: SLO compliance, defect rate, support ticket volume, change-request backlog, runbook gaps.
Owner. joint.
Risk #18 — Compliance / gaming-license schedule¶
| Field | Value |
|---|---|
| Probability | H when deploying in regulated jurisdictions |
| Impact | H — blocks Phase 7 pilot in target market |
What goes wrong. Gaming license application in target jurisdiction takes 2–18 months; customer assumes "we can launch in [country] when the platform is ready". Pilot is ready in Phase 7 but jurisdiction gating blocks public access.
Mitigation. 1. Customer compliance opens applications during Discovery (Phase 1 Activity §4). 2. Run pilot in a jurisdiction where the customer already holds a license (or in soft-launch territory) while waiting on the slower jurisdiction. 3. Treat any license as long-poll: time-box the schedule independently of platform readiness.
Owner. customer compliance.
Risk register maintenance¶
This file is a starting point. The customer's PM should:
- Re-score probability/impact at the end of each phase based on what materialized.
- Add new rows discovered in UAT and pilot — they go in the customer's ongoing risk register.
- Close out rows once mitigations are verified ("verified at" date in the row).