Engineering roadmap — future work¶
Forward-looking engineering backlog consolidated at handover. Sourced from runbooks, ADRs, the security register, performance-test results, the docs portal audit, and project memory. Every item has a file:line reference or memory citation; severity is tagged so the customer team can sequence the work.
Distinct from: -
business/roadmap.md— customer-adoption phases (commercial milestones). -delivery/phased-rollout.md— week-by-week rollout timeline. -engineering/dependencies.md— third-party SDK / vendor dependency map (sibling agent's deliverable).Open security follow-ups live in
security/internal/findings.md. This doc references but does not duplicate the SR-NNN register.
Section 1 — Executive summary¶
| Priority | Items | What's in this bucket |
|---|---|---|
| Critical | 4 | Production blockers — load failure, captcha single-point-of-failure, OOM-by-default, withdraw-lock unit bug. |
| High | 9 | Functional gaps that will fail under production traffic, admin-fe stack, and known security follow-ups already triaged. |
| Medium | 12 | Performance v2, observability gaps, ADR-required design decisions, doc-portal {{TBD}} burndown. |
| Low | 9 | Nice-to-have refactors, abstraction polish, post-launch hardening. |
| Total | 34 | Across bugs, perf, docs, architecture, features, tech debt. |
Top 3 critical items (by Section 2 ordering):
- bcrypt-cost p95 collapse — sign-in p95 jumps 15 ms → 1.09 s between 200 and 1000 VU; lower the cost factor (effort: S).
- Redis
maxmemoryunset — local stack and any deployed environment that didn't overrideREDIS_ARGSwill grow until container OOM (effort: S). - socket.io single-instance ceiling — per-instance
clientSocketsMap blocksrthorizontal scaling; install@socket.io/redis-adapter(effort: M).
All three are required before the perf-test envelope is re-run at 10k VU.
Section 2 — Known bugs (with locations)¶
Surfaced during build, perf-test, security review, or doc audit; not yet fixed. Severity reflects production-impact, not CVSS — combine with the CVSS in security/internal/findings.md when prioritizing security work.
How to read this table: severity is "what happens if this hits us in production" not "how bad is the vulnerability in isolation". A Critical row should block GA; a High row should be addressed in the first quarter post-handover; Medium rows fit Q3 and beyond.
| # | Severity | Title | Location | Discovered by | Recommended fix |
|---|---|---|---|---|---|
| 1 | Critical | bcrypt cost too high under load — sign-in p95 = 1.09 s @ 1000 VU (70× degradation vs. 200 VU baseline) | apps/api/src/.../user.service.ts (auth) |
Perf test 2026-04-25, see performance-test-report-results.md §"Stage 2" |
Lower bcrypt cost from 10 → 8 (still ≥10⁵ guesses to break); 4× CPU reduction; expected p95 drop to ~150 ms at 1000 VU |
| 2 | Critical | Redis maxmemory unset → unbounded growth → host OOM under load |
docker-compose.yml REDIS_ARGS; runbook trace at runbooks/redis-memory-pressure.md:23, :138, :162 |
Runbook authoring, Apr 2026 | Set --maxmemory <N>gb --maxmemory-policy allkeys-lru in compose; author production-sizing ADR |
| 3 | Critical | socket.io has no Redis adapter → rt cannot horizontally scale; per-instance clientSockets Map drops cross-instance emits |
apps/rt/src/.../client.gateway.ts; runbook runbooks/ws-adapter-scale-out.md; register row SR-030 |
Runbook authoring + perf test | Install @socket.io/redis-adapter, wire shared Redis pub/sub, migrate per-IP counters to Redis |
| 4 | Critical | lockWithdrawOnClaimHours unit confusion — field named "Hours" but math is * 60 * 1000 (= minutes) → 24-hour lock locks for 24 minutes |
ebit-api/apps/api/src/promo/utils/code.utils.ts:50; see features/bonuses-and-promos.md:203-209 |
Flow-doc audit | Confirm intended semantic with product, then either rename column or fix multiplier to * 60 * 60 * 1000 |
| 5 | High | Captcha is single-provider (Google reCAPTCHA), URL hard-coded | ebit-api/apps/api/src/captcha/google/recaptcha.service.ts:51; runbook runbooks/captcha-break-glass.md:75 |
Runbook authoring | Introduce CaptchaProvider interface + CAPTCHA_PROVIDER env; add hCaptcha or Cloudflare Turnstile as backup. New ADR required. |
| 6 | High | No DISABLE_CAPTCHA env break-glass; only NODE_ENV=local short-circuits |
apps/api/src/captcha/...; runbook runbooks/captcha-break-glass.md:90 |
Runbook authoring | 5-line code change: if (process.env.DISABLE_CAPTCHA === 'true') return; |
| 7 | High | Doppler has stale GEETEST_* keys but the live code path is reCAPTCHA — env drift |
Doppler dev_perf config; see doppler-perf-audit.md |
Doppler audit, Apr 2026 | Remove stale keys from Doppler; add CI check that env vars referenced in code exist and unreferenced keys are pruned |
| 8 | High | Prisma connection_limit defaults to 10 → pool exhaustion at 1000+ VU |
Prisma client config; see performance-test-report-results.md:71 |
Perf test, Apr 2026 | Set connection_limit=50 in DATABASE_URL; document in production-sizing ADR |
| 9 | High | ebit-admin-fe — 4 stacked auth/observability bugs blocking cross-service admin tracing (cookie-name mismatch, disable_set_cookies_and_mask_tokens=false, silent middleware fall-through, missing @vercel/otel) |
ebit-admin-fe/src/middleware.ts:68-90; ebit-admin-fe/src/utils/cookies.ts:52-72; ebit-admin-fe/src/instrumentation.ts; register row SR-025 |
E2E test #12 + flow-doc work; memory: project_admin_fe_auth_bugs.md |
Three-pronged fix in admin-fe repo: (1) align cookie names + flip the body-token feature flag; (2) uncomment leaveFromAccount in middleware catch; (3) mirror ebit-fe instrumentation with @vercel/otel + Sentry-DSN gate |
| 10 | High | ebit-bj app is orphaned — port 4002 ships but no in-repo FE points at it; dropbet UX flows entirely through ebit-api's /casino/games/house/blackjack/* |
ebit-api/apps/bj/; memory: project_ebit_bj_orphan.md; ADR-0011 §"Monorepo escape hatch" |
Task #33 scoping | Disposition decision required: (a) delete apps/bj/ outright, (b) rewire dropbet through it via proxy, or (c) document as intentionally-dormant with ADR. Image-builds today; cost not free. |
| 11 | High | Bet-place idle p95 ≈ 108 ms — already exceeds 100 ms SLO at 1 VU; root cause is the synchronous Prisma transaction wrapping BullMQ enqueue + Redis pub/sub RPC | See performance-test-report-results.md:89 |
Pre-existing baseline | Profile with tests-perf/deep-metrics/flame-cpu.sh; consider moving the pub/sub RPC outside the transaction or replacing with direct call |
| 12 | High | Cross-service trace propagation gap — @ExternalControllerClient Redis pub/sub transport doesn't carry W3C traceparent → speed-roulette and any future cross-app calls show as orphan roots |
ebit-api/libs/gateway/src/ms-controller/; ADR-0005; memory: project_otel_microservice_transport_gap.md; register row SR-026 (accepted) |
OTel coverage audit | Custom Nest microservice interceptor that serializes the active OTel context into the message envelope; re-extract on callee. Currently accepted as observability gap. |
| 13 | High | Speed-roulette per-job timeout absent — concurrency: 1 queue can deadlock if a job exhausts retries without re-adding follow-up |
ebit-api/apps/speed-roulette/.../roulette-state.processor.ts:23, :147-160; runbook runbooks/speed-roulette-deadlock.md:186; register row SR-024 |
Runbook authoring | Add Job.opts.timeout (e.g. lockDuration: 30_000) + watchdog cron that re-enqueues bootstrap when the queue is stale. ADR-required. |
| 14 | Medium | Reset-token JWT secret reused for email-verification — leak compromises both flows | apps/api/src/.../user.service.ts:858, 893; register row SR-011 |
Security audit | Separate secrets per token-type; add DB-side one-shot token table with consumed_at (also closes SR-012) |
| 15 | Medium | BullMQ broadcast queue depth has no Grafana alert — operators have no early warning when rt is back-pressuring |
Grafana provisioning under observability/grafana/...; runbook runbooks/ws-adapter-scale-out.md:180 |
Runbook authoring | File alert: bull:*-broadcast:wait > 100 for > 60s |
Cross-reference: every "High" + "Critical" entry above maps to either an SR-NNN row in security/internal/findings.md or a {{TBD}} slot in a runbook — see Section 4 for the docs-portal correspondence.
Section 3 — Performance bottlenecks (post-test backlog)¶
Direct extract of the recommendations table in performance-test-report-results.md §"Recommended remediations". Order = expected impact at 1000 VU.
Context: the most recent perf run (2026-04-25) measured a sign-in p95 of 15 ms at 200 VU and 1.09 s at 1000 VU on a c7g.4xlarge SUT (16 vCPU, 32 GB). Bet-place was not exercised under load because sign-in saturated first; bet-place idle p95 at 1 VU is already 108 ms (over the 100 ms SLO). Both observations point to the items in this section.
| # | Action | Effort | Expected effect | Cross-ref |
|---|---|---|---|---|
| 1 | Lower bcrypt cost from default 10 → 8 (still ≥10⁵ guesses to break) | S | 4× CPU reduction; p95 likely drops to ~150 ms at 1000 VU | Bug #1 in §2 |
| 2 | Increase Prisma connection_limit=50 (currently default 10) |
S | Removes pool wait contention | Bug #8 in §2 |
| 3 | Add ebit-api horizontal scaling (single-instance today) — autoscaling group + ALB + sticky sessions | M | Linear capacity growth | Section 5 — new ADR required |
| 4 | Wire socket.io-redis-adapter for rt horizontal scaling — currently per-instance clientSockets Map |
M | Multi-replica rt works without dropping emits |
Bug #3 in §2; SR-030 |
| 5 | Profile with tests-perf/deep-metrics/flame-cpu.sh to confirm bcrypt is THE bottleneck (not just the suspected one) |
S | Confirms or refutes hypothesis #1 — if wrong, redirects effort | perf-trace-coverage-audit.md |
| 6 | Bet-place 100 ms SLO recovery — break the synchronous transaction-wraps-RPC anti-pattern | L | Restores headroom for bet-place (currently >SLO at 1 VU) | Bug #11 in §2 |
| 7 | Re-run stepped-ramp 1k → 10k VU after #1–#5 land; capture clean numbers; soak 1h sustained | M | Validates the full perf-test envelope; was deferred this run | performance-test-report-results.md:149 |
| 8 | Tail-sampling rate review — current 10% probabilistic; revisit if dashboards show exemplar-density gaps | S | Maintains forensic coverage without overflowing 50 GB Badger budget | ADR-0012 §"Revisit triggers" |
The perf-test rig (Terraform terraform/perf/) is destroyed per teardown; re-applying takes ~3.5 hr × $1.65/hr ≈ $5.78 per pass.
Section 4 — Documentation gaps (38 TBDs from PORTAL-AUDIT)¶
PORTAL-AUDIT.md v3 §3 categorized 38 engineering-fillable {{TBD}} slots. They are surfaced by the CI scanner and are correctly marked — burn-down is normal-cadence work, not a customer-share blocker.
Distribution per the audit: 8 each in handover/ (customer) and versions/ (engineering); 7 each in security/ and engineering/; 6 in runbooks/; 5 in recipes/; 4 in delivery/ (customer); 3 in api/; 2 in incidents/ (customer). The customer-fillable share is real but not engineering's domain to close.
4a. Engineering-fillable (~38 slots) — ours to close¶
Highest-leverage to close first (mirrors the bugs in §2):
- Production CDN/WAF runbook (
runbooks/ws-adapter-scale-out.md:159+ sibling references) — depends on customer-team CDN choice. Engineering authors the procedure; customer fills the vendor. - Redis sizing tuning (
runbooks/redis-memory-pressure.md:23, 138) — once production stack is finalized. - Captcha provider abstraction ADR (
runbooks/captcha-break-glass.md:75). - socket.io Redis adapter ADR (
runbooks/ws-adapter-scale-out.md:137). - Postgres PITR procedure (
runbooks/db-down.md:127). - Production Postgres replication setup (
runbooks/db-down.md:187). - Production Postgres connection-limit sizing (
runbooks/db-high-load.md:155). - Speed-roulette job-timeout policy ADR (
runbooks/speed-roulette-deadlock.md:186). - BullMQ broadcast-queue Grafana alert (
runbooks/ws-adapter-scale-out.md:180). - FE-side reCAPTCHA token caching policy (
runbooks/captcha-break-glass.md:114).
The remaining ~28 are scattered across engineering/api.md, security back-references, and recipes/-pending-features. Each is a one-paragraph fill once the corresponding decision lands.
4b. Customer-team-fillable (~14 slots) — not engineering's domain¶
Per PORTAL-AUDIT.md:182: contact info, PagerDuty schedule names, Slack channel names, video bridge URLs, contractual SLA wording, vendor-NDA pre-cleared phrasing. Expected empty at handover.
4c. Naturally future-state — fills with time¶
Post-launch incident records (none yet); SLO actuals after first week of prod traffic; recipes for features not yet shipped (mobile companion, additional locales).
4d. Structural debt¶
Mermaid corpus is clean as of MERMAID-AUDIT.md: 81/81 blocks parse. No structural debt there. Open delivery-track anchor links (21 broken anchors per PORTAL-AUDIT.md:127-138) are a delivery-doc fix, not engineering's.
Section 5 — Architectural intent items¶
ADRs that explicitly flagged "may revisit if X" triggers, plus new ADRs that need to be authored to close {{TBD}} slots.
5a. Existing ADRs with revisit triggers¶
| ADR | Trigger | Likelihood | If triggered |
|---|---|---|---|
| ADR-0011 §"Revisit triggers" | rt becomes the bottleneck and wants a different runtime; backend team grows past ~5; second backend stack (Python/Go) needs to coexist |
Medium — rt scaling is the §3 item #4 question |
Split rt into its own repo; libs/ stay clean (no Nest-specific assumptions in shared types) |
| ADR-0003 §"Future Fast Track decision" | Product ships Fast Track | Low-medium — depends on commercial commitment | Set disabled = false at fast-track.rmq.module.ts:8; populate Doppler FASTTRACK_JWT_*; verify 11 call sites; supersede this ADR |
| ADR-0009 §"Considered alternatives" | Badger sweats under high-cardinality writes | Low — perf-test held within budget | Migrate to Tempo single-binary on local backend (documented fallback); accept TraceQL learning curve |
| ADR-0012 §"Revisit triggers" | Probabilistic 10% causes exemplar-density gaps; forensic budget grows; compliance forces 100% retention | Low | Move to higher rate; revisit when EBS budget grows |
5b. New ADRs required (not yet authored)¶
| Topic | Driver | Sponsoring source |
|---|---|---|
Redis memory cap policy (maxmemory + eviction strategy) |
Bug #2; runbook gap | runbooks/redis-memory-pressure.md:162 |
| socket.io Redis adapter + per-IP counter migration | Bug #3; SR-030 | runbooks/ws-adapter-scale-out.md:137 |
| Captcha provider abstraction (interface + at least one backup) | Bug #5; runbook gap | runbooks/captcha-break-glass.md:75 |
| Speed-roulette job-timeout policy | Bug #13; SR-024 | runbooks/speed-roulette-deadlock.md:186 |
| Production Postgres sizing — pool, replication, PITR | Bug #8 + runbook gaps | runbooks/db-*.md |
| ebit-api horizontal-scaling topology — ASG + ALB + session affinity | Perf #3 | performance-test-report-results.md:80 |
ebit-bj orphan disposition |
Bug #10; ADR-0011 escape hatch | Memory: project_ebit_bj_orphan.md |
OTel context propagation across @ExternalControllerClient |
Bug #12; SR-026 | ADR-0005, memory: project_otel_microservice_transport_gap.md |
Each new ADR should follow the format under adr/README.md (Context / Decision / Considered alternatives / Consequences / Revisit triggers / References). Sponsoring source means: which runbook or report is currently embedding the {{TBD}} slot the ADR closes.
Section 6 — Feature roadmap (TBD with product team)¶
Product / commercial wishlist items surfaced in code, recipes, or memories — not yet specified. Each carries {{TBD with product}} until the customer team scopes.
| Feature | Status | Effort | Prereqs | Customer-team input needed |
|---|---|---|---|---|
| Mobile-app companion (web-only today) | Proposed | L | API contract finalization; auth-flow review (cookie-domain semantics differ on native) | {{TBD with product}} — target platforms (iOS/Android), distribution model |
| Multi-currency improvements | Proposed | M | Resolve SR-036 (request-time FX vs row-stamped), SR-037 (TETH/ETH ambiguity) | {{TBD with product}} — supported currency list, FX-source vendor |
| Provably-fair RNG audit-ability | Proposed | M | Re-enable JwtGuard per SR-001; close SR-008 fairness-seed race |
{{TBD with product}} — public-audit interface scope |
| Affiliate v2 with tier improvements | Proposed | M | {{TBD with product}} |
Tier definition, commission curves |
| VIP program enhancements | Proposed | M | {{TBD with product}} |
VIP-level criteria, perks |
| Live-chat integration depth (currently Intercom embed only) | Proposed | S–M | {{TBD with product}} |
Vendor choice; depth (chat only vs co-browse) |
| Sportsbook integration completeness (PM8 partial) | Scoped (partial) | L | Resolve SR-033 (sportsbook bets hidden by hard-coded filter at bet.repository.ts:280) |
{{TBD with product}} — odds provider, settlement flow |
Additional locales (currently en + de) |
Proposed | S per locale | next-intl is wired; copy + translation review |
{{TBD with product}} — target market list |
Withdrawal flow depth (block per lockWithdrawOnClaimHours) |
In-flight | S | Bug #4 fix lands | None — engineering-driven once unit bug closed |
{{TBD with product}} markers on this section: 8.
6a. Feature scoping notes¶
- Mobile-app companion. Cookie-based auth from
ebit-apidoes not transfer cleanly to native shells (SameSite=Laxsemantics differ across iOS WebView and React Native fetch). A dedicated/auth/mobile/*endpoint set with bearer-token issuance is the typical pattern; coordinate with the auth team before scoping. Out-of-band: app-store review timeline (~2 weeks Apple, 1 week Google) is a hard dependency on launch date, not on engineering capacity. - Provably-fair RNG audit-ability. Today seeds are exposed via
/bets/house-games/info/<betId>(currently unguarded — SR-001). The fix lands authentication; the feature is a public endpoint or downloadable proof bundle that lets a third-party reproduce the RNG. Two designs: (a) per-bet downloadable JSON; (b) Merkle-tree commitment posted on-chain. (b) is materially more work and depends on{{TBD with product}}re: chain choice. - Sportsbook integration completeness. PM8 partial scope already in code; SR-033 documents that
bet.repository.ts:280filters out sportsbook bets — closing that filter is a 1-line change, but the surrounding settlement flow + odds pipeline are not yet wired.
Section 7 — Quarterly milestones¶
Placeholder buckets. Dates {{TBD assign during handover roadmap meeting}}. Sequencing reflects the dependency chain — Q1 unblocks Q2; Q3 builds on stable Q2 foundation.
Q1 — Critical bug fixes + production sizing¶
Date: {{TBD}}
- Bug #1 — bcrypt cost reduction (S)
- Bug #2 — Redis maxmemory policy (S) + new ADR
- Bug #3 — socket.io-redis-adapter installed + ADR (M)
- Bug #4 — lockWithdrawOnClaimHours unit fix (S)
- Bug #8 — Prisma connection_limit raised (S)
- Captcha provider abstraction (Bug #5) + DISABLE_CAPTCHA env (Bug #6) (M)
- ADR backlog from §5b items 1–4
Q2 — Performance v2 + horizontal scaling¶
Date: {{TBD}}
- Re-run stepped-ramp 1k → 10k VU on c7g.4xlarge (perf #7)
- Add ebit-api horizontal scaling — ASG + ALB (perf #3) + ADR
- Bet-place 100 ms SLO recovery (Bug #11) — break the sync-RPC-in-transaction anti-pattern
- Critical/High security findings: SR-001, SR-002, SR-003, SR-004, SR-009, SR-010, SR-011, SR-012, SR-018, SR-025 (10 items per security/internal/findings.md due Q2)
- admin-fe four-bug stack closed (Bug #9)
Q3 — New features + observability close-out¶
Date: {{TBD}}
- Section 6 features per product priority
- Cross-service trace propagation (Bug #12 / ADR-0005) — Nest microservice interceptor
- Speed-roulette job-timeout policy (Bug #13) + ADR
- Remaining medium security findings due Q3 (SR-007, SR-008, SR-014–SR-024 per register)
Q4 — Tech debt sweep¶
Date: {{TBD}}
- Section 8 items
- Monorepo split if rt scaling demands it (ADR-0011 escape hatch)
- Soak/forensic re-run; SLO actuals captured for {{TBD}} post-launch slots from §4c
- Low-severity register burndown (SR-031, SR-037, SR-039, SR-043, SR-045, SR-048, SR-049)
Section 8 — Tech debt ledger¶
Specific patterns flagged for refactor. Distinct from §2 bugs in that nothing is broken — these are friction points that compound over time.
| # | Item | Location | Friction note | Recommendation |
|---|---|---|---|---|
| 1 | Payment-provider abstraction is convention-only — no PaymentProviderInterface, every provider hand-wired |
recipes/add-payment-provider.md:6, 130; apps/api/src/payment/... central wiring |
Medium | Introduce strategy-pattern interface; auto-discover via @Inject token |
| 2 | KYC abstraction is vendor-specific (Sumsub-namespaced), not strategy-pattern — plan to rewrite, not swap | recipes/swap-kyc-provider.md:6, 71; apps/api/src/kyc/sumsub/... |
Very high | Strategy pattern under KycProviderInterface + @Inject('KYC_PROVIDER') token; gated by KYC_PROVIDER_NAME env |
| 3 | OTel transport gap on @ExternalControllerClient — orphan trace roots across Redis pub/sub RPC |
libs/gateway/src/ms-controller/; ADR-0005; memory: project_otel_microservice_transport_gap.md |
High | Custom Nest microservice interceptor that serializes the active OTel context |
| 4 | ebit-bj app orphan — port 4002 image-builds but receives zero traffic |
apps/bj/; memory: project_ebit_bj_orphan.md |
High | Disposition decision (delete / rewire / document) — see §5b |
| 5 | RabbitMQ stub — broker boots but receives zero traffic (Fast Track stub at disabled = true) |
apps/api/src/fast-track/rabbitmq/fast-track.rmq.module.ts:8; ADR-0003 |
Low | If Fast Track ships, follow ADR-0003 §"Future Fast Track decision"; if dead, follow §"If product decides Fast Track is permanently dead" — remove broker, 11 call sites, env, ADR |
| 6 | Per-instance presence map (clientSockets Map) breaks at >1 rt replica |
apps/rt/src/.../client.gateway.ts; SR-030 |
High | Redis-backed presence + socket.io adapter (see Bug #3) |
| 7 | Duplicate-email race at sign-up returns 500 instead of 400 — bots can fingerprint live users | apps/api/src/.../auth.service.ts:67-86; SR-020 |
Medium | Catch P2002, return 400 EMAIL_TAKEN; pre-check inside transaction |
| 8 | O(n_sockets) balance push iterates clientSockets.forEach per BalanceUpdated |
apps/rt/src/.../client.gateway.ts:306-315; SR-017 |
Medium | Per-user socket.io rooms: this.server.to('user:'+id).emit(...) |
| 9 | Bet status index missing — power-user list degraded under traffic |
Prisma Bet schema; SR-032 |
Low | Add covering index on status |
| 10 | LeaderboardQueueProducer has zero call sites |
apps/api/src/leaderboard/...; SR-039 |
Low | Delete or wire — confirm intent first |
| 11 | RACE_ENABLED per-handler inline guard, easy to forget |
apps/api/src/leaderboard/...; SR-038 |
Low | Centralize behind a feature-flag service |
| 12 | In-process Map cache 60s — api vs bo serve up to 60 s stale | SR-041 (accepted) | Low | Document staleness budget; revisit if observed in user-facing report |
| 13 | usdAmount request-time FX vs row-stamped |
SR-036 | Low | Stamp FX rate on row at insert; reconcile any historical drift |
| 14 | EvoLogger / winston coexistence with nestjs-pino | ADR-0001 §"Considered options" #3; memory: project_evologger_trace_correlation.md |
Low | Possible future state: drop EvoLogger, use pino everywhere; not worth the effort today per ADR-0001 |
8a. Friction-map summary (from recipes/integration-cookbook.md)¶
The integration cookbook's friction map already classifies abstraction-debt items:
- Recipe 1 (add-payment-provider) — medium friction — convention-only abstraction (#1 above).
- Recipe 6 (swap-kyc-provider) — very high friction — vendor-specific, plan to rewrite (#2 above).
Other recipes pass the friction filter cleanly (S/M effort, low surprise).
Section 9 — How to update this doc¶
9a. Quarterly review¶
- Owner: engineering lead.
- Cadence: end of each quarter, before the next quarter's planning meeting.
- Inputs: every new runbook authored that quarter, every ADR written/amended, every entry in
security/internal/findings.md, the most recent perf-test report. - Output: each Section's tables get a delta paragraph; closed items move to a
## Historyannex (not yet created).
9b. New finding → backport¶
When a runbook, ADR, or security audit surfaces a new follow-up:
- Add a row to the relevant Section (§2 bugs, §3 perf, §4 docs, §5 ADRs, §8 debt).
- Use a stable ID per row so cross-doc references survive. Convention:
RFW-NNN(roadmap-future-work, sequential), parallel toSR-NNN. - Cite source as
path:line(file:line) or as an explicit memory reference (memory: <name>.md). - If the finding maps to an SR-NNN row, link both ways.
9c. Aggregator script¶
Author tools/docs/refresh-roadmap.sh:
- Greps
{{TBD}}markers indocs/runbooks/+docs/adr/+docs/engineering/. - Diffs against the previous run's snapshot (committed under
tools/docs/.roadmap-tbd-snapshot). - Flags new TBDs as candidates for §4 burndown.
- Runs in CI alongside the existing link-check / mdlint / terminology / TBD-detector workflow (
tools/docs/).
Until that script exists, the manual incantation is:
grep -rEn '\{\{TBD' docs/runbooks/ docs/adr/ docs/engineering/ \
| grep -v 'roadmap-future-work.md' \
| sort -u
Cross-references¶
business/roadmap.md— customer-adoption phases (commercial milestones).delivery/phased-rollout.md— week-by-week rollout timeline.engineering/dependencies.md— third-party SDK / vendor dependency map (sibling agent's deliverable; create if not yet authored).security/internal/findings.md— full SR-NNN security register.performance-test-report-results.md— perf-test bottlenecks identified.PORTAL-AUDIT.md§3 — engineering-fillable{{TBD}}categorization.MERMAID-AUDIT.md— diagram corpus health (clean as of 2026-04-25).runbooks/— every operational runbook with embedded{{TBD}}markers.adr/— architecture decision records with explicit revisit triggers.
History annex¶
Append-only log of items closed since handover. Format: {{TBD: date}} — RFW-NNN — short description — closing PR / commit.
{{TBD}}— no entries yet; first quarterly review will populate this section.
When an item is closed, move its row from the live Section to this annex with the closure date and link to the PR or commit. Do not delete the row — preserving history is what makes the doc usable across a year of operations.