Service Catalog¶
CMDB-style master inventory of every service, datastore, and external dependency in Evospin / dropbet. One look-up surface for incident response: find the service in §"Application services" or §"Find by symptom", read its card, jump to its runbook + dashboard + owner.
Audience: Tier 1 / Tier 2 on-call, customer-team SRE, and any new-hire who needs the "what's in this thing?" map. Reading order: skim §"Find by symptom" first, then drill into the service card. The cards do not duplicate runbook or dashboard content — they link to it.
Cross-links:
architecture/service-map.mdfor the architectural view (with Mermaid C4 diagrams),external-services.mdfor the protocol-level / fallback details on third-parties.
Reading the cards¶
Every service card has the same shape. Fields that don't apply are explicitly marked n/a; fields the customer team owns are marked {{TBD: customer-team}}; fields engineering owns are marked {{TBD: engineering}}.
- Type: docker / NestJS app / Postgres / external SaaS / k8s / EC2
- Repo + source path: where the code lives
- Port + URL: local + perf staging
- Health endpoint: HTTP path or TCP probe
- Owner: customer team role responsible (Tier 1/2/3)
- SLA target: from business/nfr-sla.md
- Dashboard / Runbook / API ref / Logs / Traces / Doppler config
- Dependencies / Dependents
- Critical alerts / Notes
1. Application services (5 NestJS apps)¶
ebit-api — main REST API¶
- Type: NestJS application
- Repo / source:
ebit-api/apps/api/ - Port: 4000 (host); 4000 (container internal)
- URL (local): http://localhost:4000/swagger
- URL (perf staging):
<SUT_IP>:4000— seeterraform/perf/.outputs.envafterterraform apply - Health endpoint:
/health(TCP probe in compose; full check via Swagger reachability) - Owner: customer-team Tier 2 senior on-call · {{TBD: customer-team}}
- SLA target: p95 sign-in < 200 ms, p95 bet-place < 200 ms, p95 balance < 100 ms — see
business/nfr-sla.md - Dashboard: Grafana →
service-overview+ebit-perf-test(/observability/grafana/provisioning/dashboards/) - Runbooks:
runbooks/db-high-load.md·runbooks/db-down.md·runbooks/bullmq-job-stuck.md· general triage inhandover/oncall-runbook.md - API reference:
api-reference/api.md - Logs: Loki
{service_name="ebit-api"} - Traces: Jaeger → service
ebit-api - Doppler config:
ebit-api/dev_perf(and per-env analogs) - Dependencies: Postgres (
ebit-db) · Redis cache (ebit-redis) · Redis bot (ebit-redis-bot) · OTel collector - Dependents:
ebit-fe(SSR + browser fetches) ·ebit-admin-fe(SSR) ·ebit-rt(token validation) · BullMQ producers across the fleet - Critical alerts: error rate > 5% for 5 min · p95 > 2× SLA ·
pg_stat_activity.idle_in_transaction> 5 — wired in{{TBD: engineering — alert routing}} - Notes: Manual
tracer.startActiveSpanwraps inAuthService.loginandUserService.authenticate(seeadr/0007-evologger-kept-not-migrated.md). Highest blast radius — every player flow flows through this service.
ebit-rt — realtime / websocket gateway¶
- Type: NestJS application (socket.io, websocket-only — polling disabled)
- Repo / source:
ebit-api/apps/rt/ - Port: 4001 (host) — namespace
/events - Health endpoint: TCP probe on 4001
- Owner: customer-team Tier 2 senior on-call · {{TBD: customer-team}}
- SLA target: connection success > 99%, p95 handshake < 200 ms — see
business/nfr-sla.md - Dashboard:
service-overview(rt panel) + browser RUM - Runbooks:
runbooks/ws-adapter-scale-out.md - API reference:
api-reference/rt-events.md - Logs / Traces: Loki
{service_name="ebit-rt"}· Jaeger →ebit-rt - Doppler config: same as
ebit-api - Dependencies: Postgres, Redis cache,
ebit-api(token validation hop) - Dependents:
ebit-fe(player live updates) - Critical alerts: throttler ban count > 0/sec for 1 min · CPU > 80% for 2 min · concurrent connections > measured ceiling
- Notes: Single replica today —
@socket.io/redis-adapternot installed. Horizontal scale-out blocked; seerunbooks/ws-adapter-scale-out.md§Cause and AF-3 inarchitecture.md.
ebit-bj — blackjack server (orphaned)¶
- Type: NestJS application
- Repo / source:
ebit-api/apps/bj/ - Port: 4002
- Owner: engineering — currently no production callers · {{TBD: engineering — disposition decision}}
- Status: Orphaned. Has its own session-token scheme and EVO-Games wallet RPC. The dropbet client exclusively hits
ebit-api's/casino/games/house/blackjack/*instead —bjis never called from the in-repo FE orapi. AF-4 inarchitecture.md. Documented for completeness; do not page on its alerts. - Dashboard / Runbooks / API ref:
service-overview(bj panel — usually flat) · no dedicated runbook (general triage inhandover/oncall-runbook.md) · no Swagger - Logs / Traces: Loki
{service_name="ebit-bj"}· Jaeger →ebit-bj(orphan trace roots —traceparentdoesn't propagate over Redis pub/sub; seeadr/0005-no-traceparent-on-redis-rpc.md) - Dependencies: Postgres, Redis cache
- Dependents: none in-repo
ebit-bo — back-office API¶
- Type: NestJS application (separate Swagger, separate route tree)
- Repo / source:
ebit-api/apps/bo/ - Port: 4003
- URL (local): http://localhost:4003/swagger
- Health endpoint: TCP probe on 4003
- Owner: customer-team Tier 2 (admin operations) · {{TBD: customer-team}}
- SLA target: best-effort; not on the player-facing critical path
- Dashboard:
service-overview(bo panel) - API reference:
api-reference/bo.md - Logs / Traces: Loki
{service_name="ebit-bo"}· Jaeger →ebit-bo - Dependencies: Postgres, Redis cache
- Dependents: internal ops tooling only · note:
ebit-admin-fedoes not callbodirectly today; it routes throughapi(seeflows/admin-bets.md).
ebit-speed-roulette — multiplayer roulette state machine¶
- Type: NestJS application (BullMQ-driven state queue,
concurrency: 1) - Repo / source:
ebit-api/apps/speed-roulette/ - Port: 4004
- Owner: customer-team Tier 2 (multiplayer game-state) · {{TBD: customer-team}}
- SLA target: round duration ≈ 27 s end-to-end (no stall > 90 s); reveal-secret correctness 100%
- Dashboard:
service-overview+bullmq(queue depth panel forspeed-roulette:state) - Runbooks:
runbooks/speed-roulette-deadlock.md - Logs / Traces: Loki
{service_name="ebit-speed-roulette"}· Jaeger →ebit-speed-roulette(orphan roots same asebit-bj) - Dependencies: Postgres, Redis cache (BullMQ), EOS blockchain (block source for RNG)
- Dependents:
ebit-apiproxies player calls;ebit-rtpushesSpeedRouletteStateUpdateto clients - Notes: Single replica by design —
concurrency: 1is the correctness gate. EOS provider lag tips into round-stall.
2. Frontend services (2 Next.js apps)¶
ebit-fe — dropbet (player site)¶
- Type: Next.js 14 App Router (pnpm)
- Repo / source:
ebit-fe/ - Port: 3000
- URL (local): http://localhost:3000
- Owner: customer-team Tier 1 first-line · {{TBD: customer-team}}
- SLA target: p95 page TTFB < 400 ms, p95 LCP < 2.5 s — see
business/nfr-sla.md - Dashboard:
browser-rum(/observability/grafana/provisioning/dashboards/browser-rum.json) - Runbooks: page through
handover/oncall-runbook.md§3 — no FE-specific runbook today · {{TBD: engineering — authorrunbooks/fe-build-failure.md}} - Logs / Traces: browser RUM via
@vercel/otel; SSR logs via Loki{service_name="ebit-fe"} - Dependencies:
ebit-api(REST + WS viaebit-rt) - Notes: i18n via
next-intl; SVG handling via@svgr/webpack. Connects toebit-rtviasocket.io-client(websocket transport only).
ebit-admin-fe — internal admin panel¶
- Type: Next.js 14 + NextUI + Ant Design charts (pnpm)
- Repo / source:
ebit-admin-fe/ - Port: 3001
- Owner: customer-team Tier 2 (admin ops) · {{TBD: customer-team}}
- Status: Sign-in flow has 4 known integration bugs — cookie-name mismatch, missing OTel, no
propagateContextUrls, hardcoded API host. Use Swagger directly for admin operations until fixed. Seeflows/admin-sign-in.mdandonboarding/day-one.md§9. - Dashboard:
service-overview(admin-fe panel) - Logs / Traces: Loki
{service_name="ebit-admin-fe"}; tracing currently broken — see Status above - Dependencies:
ebit-api(admin endpoints) - Operator reference: 22-screen guide at
admin/README.md— every screen cites its admin-fe route + the matchingapps/api/src/**/admin*.controller.tsfile:line
3. Data services (3)¶
ebit-db — Postgres 13¶
- Type: Postgres 13-bullseye in compose; managed instance in production · {{TBD: customer-team — production instance type}}
- Port: 5555 (host) → 5432 (container)
- Owner: customer-team Tier 2 (DB) · {{TBD: customer-team}}
- Dashboard:
prisma-postgres - Runbooks:
runbooks/db-high-load.md·runbooks/db-down.md·runbooks/login-fails-bcrypt.md - Logs: docker logs only (not in Loki by default)
- Dependencies: none (root datastore)
- Dependents: every NestJS app
- Notes: split Prisma schema (
api,blackjack,speed_roulette) — seeadr/0006-split-prisma-schema.md. No replication in compose; production replication shape is{{TBD: engineering}}.
ebit-redis — cache Redis (:6379)¶
- Type:
redis/redis-stack:latest(stdlib + RedisJSON + RediSearch) - Port: 6379 (host) — password
cache - Owner: customer-team Tier 2 (DB) · {{TBD: customer-team}}
- Dashboard:
redis - Runbooks:
runbooks/redis-memory-pressure.md·runbooks/bullmq-job-stuck.md - Dependents: every NestJS app · all production BullMQ queues except bot queues
- Notes: No
maxmemoryconfigured indocker-compose.yml. Production sizing is{{TBD: engineering — see redis-memory-pressure.md §Prevention}}.
ebit-redis-bot — bot Redis (:6380)¶
- Type: same as cache; separate instance to isolate bot-driven load
- Port: 6380 (host) — password
bot - Owner: customer-team Tier 2 · {{TBD: customer-team}}
- Dashboard:
redis(filter by instance) - Runbooks: same patterns as cache Redis
- Dependents: bot-related BullMQ queues only (
bots-bet,bots-session-scheduler,bots-start-session,challenges)
4. Async / messaging (2)¶
BullMQ — Redis-backed job queues¶
- Type: in-process queue library (no daemon); state lives in
ebit-redisandebit-redis-bot - Repo / source:
ebit-api/apps/*/queue/andebit-api/apps/*/bull/ - Owner: customer-team Tier 2 · {{TBD: customer-team}}
- Dashboard:
bullmq— depth per queue - Runbooks:
runbooks/bullmq-job-stuck.md - Notes: 13 queues total. All production async work rides BullMQ — see
adr/0003-bullmq-not-rabbitmq.md. Per-queue → Redis-instance map inrunbooks/bullmq-job-stuck.md§1.
ebit-rabbitmq — stubbed broker (vhost ft)¶
- Type: RabbitMQ in compose; zero traffic — wired only to
apps/api/src/fast-track/rabbitmq/fast-track.rmq.module.tswhich returns a stubbed no-op (disabled = true) - Port: 5672 (AMQP), 15672 (UI — user
rabbitmq/ passrabbitmq, vhostft) - Owner: engineering — disposition pending · {{TBD: engineering}}
- Status: Container runs and passes healthchecks but is unused. Don't page on it.
- Notes: When debugging a stalled queued job, look in Redis (
KEYS bull:*), not RabbitMQ — see/CLAUDE.md"Async queues — BullMQ, not RabbitMQ".
5. Observability (5)¶
otel-collector¶
- Type: OpenTelemetry Collector (OTLP gateway)
- Port: 4317 (gRPC), 4318 (HTTP), 13133 (health)
- Owner: customer-team Tier 2 (observability) · {{TBD: customer-team}}
- Runbooks:
runbooks/trace-missing.md·runbooks/loki-no-logs.md - Config:
/observability/otel-collector.yml - Dependents: every NestJS app + browser RUM exporter
- Notes: spanmetrics connector derives RED metrics from spans — see
adr/0002-spanmetrics-over-prisma-metrics.md.
jaeger (v2 + Badger)¶
- Type: Jaeger v2 with Badger storage backend
- Port: 16686 (UI), 4317 (OTLP)
- Owner: customer-team Tier 2 · {{TBD: customer-team}}
- Config:
/observability/jaeger-v2-config.yaml - Notes: storage decisions documented in
audits/jaeger-storage-research.md. Tail-sampling is{{TBD: engineering — ADR pending}}.
prometheus¶
- Type: Prometheus TSDB
- Port: 9090
- Config:
/observability/prometheus.yml - Owner: customer-team Tier 2 · {{TBD: customer-team}}
- Notes: scrapes
otel-collectorfor spanmetrics-derived RED.
loki¶
- Type: Grafana Loki (log aggregator)
- Port: 3100
- Config:
/observability/loki.yml - Owner: customer-team Tier 2 · {{TBD: customer-team}}
- Runbooks:
runbooks/loki-no-logs.md
grafana¶
- Type: Grafana (dashboards + Explore)
- Port: 3003 (
admin/grafanafor local) - Dashboards:
/observability/grafana/provisioning/dashboards/— 8 provisioned (service-overview,perf-test,perf-system,logs-trace-pivot,prisma-postgres,bullmq,redis,browser-rum) - Owner: customer-team Tier 2 · {{TBD: customer-team}}
6. Performance-test infrastructure (3, transient)¶
These exist only when a perf run is active. Provisioned via /terraform/perf/; destroyed via terraform destroy after capture.
| Service | Instance | Role |
|---|---|---|
| SUT (System Under Test) | EC2 c7g.4xlarge |
Runs the full Evospin stack from ECR (api / rt / bj / bo / speed-roulette + both FEs + Postgres + 2× Redis + RabbitMQ) |
| Monitoring | EC2 c7g.xlarge |
OTel Collector + Prometheus + Grafana + Loki + Jaeger + node-exporter |
| Loadgen | EC2 c7g.4xlarge |
k6 v0.56 + Node 22 + pnpm 9.11 + Playwright (100 concurrent Chromium) + node_exporter |
- Owner: customer-team Tier 2 (perf) · {{TBD: customer-team}}
- Dashboards: monitoring host serves the same Grafana provisioning as local
- Runbooks:
perf-run-checklist.md·performance-testing.md(methodology) ·performance-test-report.md(last run) - Doppler config:
dev_perf(separate from local)
7. Third-party / external dependencies¶
These are services we consume but don't operate. The card is shorter — what we use it for, fallback if unavailable, contract owner.
| Service | What we use it for | Fallback if unavailable | Owner |
|---|---|---|---|
Doppler (workspace ebit) |
Secrets distribution for runtime + CI | .env files (local only); production has no fallback — outage = no-deploy |
customer-team {{TBD}} |
| Google reCAPTCHA v3 | Sign-up + sign-in + forgot-password gating | runbooks/captcha-break-glass.md — currently no backup provider configured (engineering follow-up) |
customer-team {{TBD}} |
| Sumsub | KYC verification | Manual review queue | customer-team {{TBD}} |
| CCPAYMENT | Crypto payments processor | Alternative provider via {{TBD: engineering — payments-abstraction layer}} | customer-team {{TBD}} |
| NowPayments | Crypto payments processor (secondary) | CCPAYMENT primary | customer-team {{TBD}} |
| Softswiss | Slots provider | Other slot providers in catalog (apps/api/src/casino/slots/providers/) |
customer-team {{TBD}} |
| PM8 | Slots provider | Same | customer-team {{TBD}} |
| MaxMind GeoIP | Country gating + restricted-country list | Block all on lookup failure (fail-secure) | customer-team {{TBD}} |
| CoinGecko | Exchange rates for ExchangeRatesService.toUsd() |
Cached rates degrade quietly; alert on stale rate | customer-team {{TBD}} |
| SendGrid | Transactional email (verification, password reset, marketing) | Local mode bypasses entirely (isLocal); production has no fallback today — {{TBD: engineering — second-provider abstraction}} |
customer-team {{TBD}} |
| Sentry | Error tracking + perf monitoring | Errors still log to Loki/stdout — Sentry is observability, not critical-path | customer-team {{TBD}} |
| EOS blockchain (public nodes) | RNG block-source for speed-roulette WAITING_BLOCK state |
Round stalls in WAITING_BLOCK — see runbooks/speed-roulette-deadlock.md |
engineering {{TBD}} |
| EVO wallet RPC (Skindeck) | Skin-deposit settlement | Deposits queue and retry; player sees pending state | customer-team {{TBD}} |
For protocol-level details (auth headers, rate limits, observed failure modes) see external-services.md.
8. Dependency graph¶
Internal topology (clients · frontends · apps · datastores · observability) is covered in architecture/service-map.md, split into player path + admin/ops path. Read that first.
The diagram below covers what service-map.md deliberately omits — third-party integrations outbound from ebit-api, since each is an external dependency with its own runbook concerns.
flowchart LR
ebit_api["ebit-api :4000"]
subgraph fairness["Game fairness"]
eos(("EOS blockchain<br/>JSON-RPC"))
end
subgraph auth_msg["Auth + messaging"]
recaptcha(("reCAPTCHA<br/>verify token"))
sendgrid(("SendGrid<br/>SMTP"))
end
subgraph kyc_pay["KYC + payment"]
sumsub(("Sumsub<br/>KYC"))
ccp(("CCPAYMENT<br/>crypto deposits"))
end
subgraph obs["Observability (non-local)"]
sentry(("Sentry<br/>errors"))
end
ebit_api -- "JSON-RPC" --> eos
ebit_api -- "verify" --> recaptcha
ebit_api -- "send" --> sendgrid
ebit_api -- "KYC API" --> sumsub
ebit_api -- "deposit" --> ccp
ebit_api -- "errors" --> sentry
Speed-roulette is the only Nest app besides
ebit-apireaching out to a third party (EOS). All others terminate inside the Nest monorepo or hit shared datastores — seearchitecture/service-map.mdfor the internal topology.
Edge convention: solid = active in production, dashed = stubbed or no callers. The ebit-bj and RabbitMQ edges are dashed for that reason.
| Node count | Edge count |
|---|---|
| 21 | 27 |
9. Find by symptom¶
The fastest path during an incident — symptom → likely service(s) → start here.
| Symptom | Likely service(s) | Open first |
|---|---|---|
| Bet placement slow | ebit-api + ebit-db |
runbooks/db-high-load.md |
| Bet placement returns 5xx | ebit-api (Prisma transaction) + BullMQ |
runbooks/bullmq-job-stuck.md, then runbooks/db-down.md |
| Live game state stuck (>90 s) | ebit-speed-roulette (or ebit-bj, but bj has no callers) |
runbooks/speed-roulette-deadlock.md |
| Real-time updates not pushed | ebit-rt + ebit-redis (cache) |
runbooks/ws-adapter-scale-out.md — single-replica today |
| Sign-in failing for many users | ebit-api + Google reCAPTCHA + Sumsub |
runbooks/captcha-break-glass.md, then runbooks/login-fails-bcrypt.md |
| Trace gap mid-request | OTel Collector + Redis pub/sub transport | runbooks/trace-missing.md — known gap on bj/speed-roulette per adr/0005-no-traceparent-on-redis-rpc.md |
| Logs missing for a service | OTel Collector + Loki | runbooks/loki-no-logs.md |
| Wallet balance wrong (one user) | ebit-api + ebit-db |
flows/dropbet-wallet.md — check SF-013 (no overdraft guard on toVault) |
| Wallet balance wrong (many users) | ebit-api + BullMQ bet_settled_queue |
handover/oncall-runbook.md — P0; promote and page |
| Admin can't ban a user | ebit-api + ebit-bo |
flows/admin-user-mgmt.md |
| Redis OOM / eviction spike | ebit-redis (cache) + BullMQ retention |
runbooks/redis-memory-pressure.md |
| WS storm / bans climbing | ebit-rt (single-replica ceiling) |
runbooks/ws-adapter-scale-out.md |
| 2FA / MFA reset for an admin | ebit-api (auth) + Postgres |
runbooks/2fa-unknown-secret.md |
| Email not delivered | SendGrid + ebit-api |
external-services.md §SendGrid (no production fallback today) |
| Captcha fails on real traffic | Google reCAPTCHA upstream | runbooks/captcha-break-glass.md |
| KYC verification stuck | Sumsub + ebit-api |
external-services.md §Sumsub |
| Crypto deposit not credited | CCPAYMENT or NowPayments + BullMQ SKINDECK_DEPOSIT |
runbooks/bullmq-job-stuck.md |
10. Coverage gaps (where this catalog is thinnest)¶
The three services with the least operational documentation today — engineering-team follow-ups:
ebit-fe— no FE-specific runbook for build / SSR / hydration failures. Browser RUM dashboard exists; runbook is{{TBD: engineering — authorrunbooks/fe-build-failure.mdandrunbooks/fe-hydration-mismatch.md}}.ebit-bj— orphaned, but disposition decision (delete? rewire? keep as backup?) hasn't been made.{{TBD: engineering — file ADR for bj disposition}}.- External payment providers (CCPAYMENT, NowPayments) — no break-glass runbook for "payment processor down". The pattern from
runbooks/captcha-break-glass.mdapplies.{{TBD: engineering — authorrunbooks/payments-provider-down.md}}.
These three are tracked in the engineering follow-up backlog (task #35 in the doc-portal task list).
See also¶
architecture/service-map.md— full architectural service map with C4 diagrams + propagation gapsexternal-services.md— protocol-level detail on every third-party (auth headers, rate limits, observed failure modes)handover/oncall-runbook.md— first-response procedure that uses this cataloghandover/escalation-matrix.md— severity × time → who's notifiedREADME.md— portal entrypoint; this catalog is linked there as a top-level operator quick-reference