Skip to content

Non-Functional Requirements & SLA Targets

Audience: Customer's CTO / SRE lead / commercial team. Read with ../performance-testing.md and ../performance-test-report.md open. Status of numbers: Latency targets are defined in the perf methodology and measured in the perf test report. Availability, RTO, RPO, throughput-at-scale: proposed starting SLAs marked {{TBD}} until the next stepped-ramp run finishes (issue #66 / #67).


1. Latency targets (per endpoint class)

The platform's measured SLOs from docs/performance-testing.md §4. These are the contractual numbers we propose for an SLA.

Endpoint class Method p95 target Notes
POST /auth/sign-in POST 150 ms bcrypt is irreducible at ~60–80 ms per hash; budget allows for DB lookup + session creation
POST /casino/games/house/dice/bet (and other house-game bets) POST 100 ms Baseline at 1 VU = 108 ms — endpoint does not currently meet SLO at any load. Pre-noted in ../performance-testing.md. Optimisation work is required before we can ship this number contractually. {{TBD: target revision after #66 run}}
GET /bets (paginated bet history) GET 50 ms Read-mostly; cache-friendly
GET /accounting/balances GET 50 ms Cached balance lookup
WebSocket handshake (rt /events) WS 200 ms Time from TCP connect → AuthSuccess event

p50 / p99: not formally targeted in the methodology today; we recommend treating p99 as 2 × p95 for SLA carve-outs. {{TBD: explicit p50 / p99 numbers per endpoint after stepped-ramp run completes}}

System-level error budget: < 0.1 % error rate per endpoint across the test window.


2. Availability target

Tier Proposed SLA Basis
Initial commercial SLA 99.5 % monthly uptime (~3.6 h downtime / month) Conservative starting position pending production telemetry from a customer install
Aspirational target 99.9 % monthly uptime (~43 min / month) Achievable once HA Postgres + Redis + load-balanced API replicas are deployed; not guaranteed by single-node compose
Game-server availability 99.5 % for apps/speed-roulette Hard dependency on EOS RPC (../architecture.md §1) — outage of EOS stalls round queue. Budget reflects external-dependency risk

The 99.5 % figure is a proposal, not a measured value. Final SLA depends on the customer's deployment topology (single-region vs multi-region, HA database, hot-standby Redis). See roadmap.md Phase 5.

Cost note: the 99.9 % aspirational tier presumes AWS Business Support (greater of $100/mo or 3–10 % of monthly AWS spend), multi-AZ Postgres + ElastiCache, paid Sentry, and PagerDuty. Costed in infrastructure-cost.md §2 (AWS) and §6 (hidden costs).


3. Throughput

The system was designed and is being load-tested for the following stepped ramp (../performance-testing.md §5):

Stage Target VUs Duration
Warmup 50 2 min
2 1,000 5 min
3 2,500 5 min
4 5,000 5 min
5 7,500 5 min
6 (peak) 10,000 5 min

Peak target: 10,000 concurrent virtual users with stepped ramp. Sustained QPS at peak: {{TBD}} — fill from performance-test-report.md §"Per-Endpoint SLO Scorecard" once #66 runs. First SLO breach point: {{TBD}} VUs on {{TBD endpoint}} (../performance-test-report.md).


4. Concurrency (websocket)

Limit Value Source
Max websocket connections per source IP 10 (configurable via MAX_CONNECTIONS_PER_IP) libs/ws-throttler/src/const.ts:3-4; default in .example.env:111, .local.env:144, .test.env:121
WebSocket transport websocket-only (polling explicitly disabled) apps/rt/src/main.ts boots without HTTP polling fallback (../architecture.md §2)
Per-instance socket count bounded by RAM (in-process Map; no Redis socket-state adapter shipped) ../performance-testing.md §6 — known observability gap; sticky sessions required if running multiple rt replicas

Practical implication: a customer wanting >10k concurrent player sessions on one rt instance should plan for horizontal scale (multiple rt replicas with sticky sessions) — see ../architecture.md §2 port contract for the rt service shape.


5. Recovery objectives

Objective Proposed target Basis
RTO (Recovery Time Objective) 1 hour Single Postgres restore from latest snapshot + service redeploy ≤ 60 min on a sized instance. Multi-region failover not in scope of starter SLA. {{TBD: customer-specific based on their backup tooling}}
RPO (Recovery Point Objective) 5 minutes Postgres WAL-archived to object storage every 5 min in proposed deployment topology. {{TBD: actual cadence is operator-deployment-specific; Evospin ships compose configs only}}
Disaster scenarios covered Single-AZ failure, single-VM loss, accidental DROP, ransomware Multi-region, regional outage = roadmap

Today's docker-compose.local.yml ships no automated backup. RTO/RPO numbers assume the customer (or ebit-team in a managed-deployment contract) wires Postgres WAL archival + scheduled snapshots in their infra.


6. Security posture

Authentication & authorisation

Control How Where
Password hashing bcrypt (cost factor configurable in env) apps/api/src/auth/auth.service.ts
JWT access + refresh HS256-signed; refresh rotation; httpOnly secure cookies apps/api/src/auth/cookies.ts, strategies/jwt-refresh-strategy.ts
2FA (TOTP) otplib; OtpGuard mandatory on SuperAdmin routes apps/api/src/auth/guards/
Role-based access RolesGuard + PermissionGuard('<key>') libs/auth/, apps/api/src/auth/guards/
Session store Redis (cache instance) — short-lived JWT denylist + presence apps/api/src/auth/session/session.queue-producer.ts
Rate limiting Per-route sliding-window (Lua on cache Redis); per-IP WS cap apps/api/src/captcha/, libs/ws-throttler/
Captcha GeeTest v4 + reCAPTCHA fallback apps/api/src/captcha/

Encryption

Layer Status Notes
In transit (player ↔ Evospin) HTTPS / WSS required at the load balancer Compose ships HTTP for local dev; production deployment is responsible for TLS termination (typically at ALB / Nginx). Customer responsibility
In transit (Evospin ↔ Postgres / Redis / external APIs) TLS supported by all clients (Prisma, ioredis); operator-config {{TBD per deployment}}
At rest (Postgres) Operator-config (cloud-disk encryption / pgcrypto column-level) Not enforced in the application layer
At rest (Redis) Operator-config (redis password set; AOF persistence configurable) Compose sets cache / bot passwords
Secrets Doppler for dev (run_local.sh); production uses customer's secret manager (AWS Secrets Manager, Vault) .env is gitignored; templates are .example.env / .local.env / .test.env

Audit & traceability

  • Every state-changing endpoint is wrapped in OTel server spans → Jaeger (see ../e2e-trace-demo.md: a single bet emits 69 spans).
  • Structured pino logs to Loki with trace IDs for trace↔log pivot.
  • Admin actions surface in apps/api/src/user/admin/notes/ (admin-notes table) for human-readable audit trail.
  • Sentry captures runtime errors with source maps for all three apps.

Known security findings

Tracked in ../security-register.md and ../security-findings/. Three findings filed at last review (sf-001, sf-002, sf-003); see register for current status.


7. Compliance

Operator-licensable, not ebit-licensable. Evospin ships the technical surfaces; the operator owns the gambling licence in their jurisdiction.

Surface What Evospin provides What the operator owns
KYC Sumsub integration end-to-end (apps/api/src/kyc/sumsub/) — applicant lifecycle, doc upload, webhook-driven status state machine Sumsub commercial relationship + jurisdiction-specific tier mapping
AML Audit logs, transaction history, ledger immutability Suspicious-transaction reporting workflow, regulator filings
Responsible gambling users-limits/ self-exclusion + deposit/loss-limit framework (⚠ TBD coverage) Jurisdiction-specific limits (UK GamStop, etc.)
Provably-fair fairness apps/api/src/provably-fair/ server-seed rotation; EOS-anchored RNG for speed-roulette Public verification page (operator-hosted) — TBD
Geo restriction Country allow / deny list (apps/api/src/country/) Jurisdiction-by-jurisdiction restrictions list
Data residency Single-region Postgres by default EU / US / regional deployment topology — operator picks
GDPR / data subject rights User profile read endpoint, account-delete pathway (⚠ TBD verify completeness) Privacy notice, data-controller registration

Jurisdiction support: Evospin is jurisdiction-agnostic at the code level. The operator chooses where to deploy and licence; Evospin's KYC / geo / responsible-gambling surfaces support the integration pattern but do not pre-configure for any one regulator.


8. Operational visibility (observability)

Already shipping:

Signal Tool Where
Distributed tracing Jaeger v2 (port 16686) observability/, ../observability.md
Metrics (incl. spanmetrics-derived RED metrics) Prometheus (port 9090) same
Logs Loki (port 3100) bridged from pino same
Dashboards Grafana (port 3003) with provisioned datasources + dashboards same
Errors Sentry (per-repo sentry.*.config.ts) each repo
OTel ingress otel-collector (gRPC :4317 / HTTP :4318) observability/otel-collector.yml

This stack is the source of truth for SLA compliance reporting. See ../e2e-trace-demo.md for proof-of-life.


9. Maintenance windows

Activity Cadence Player-visible impact
Dependency / security patches Monthly Rolling restart per service — zero-downtime if API has ≥2 replicas; 1–2 min impact on single-replica deploys
Prisma migrations Per release Most are backward-compatible; destructive migrations require a maintenance window
Postgres major-version upgrade Annual 30–60 min maintenance window
Redis upgrade Annual Brief failover window if HA, else 5–10 min

{{TBD: customer-specific maintenance contract — none ship in code today}}


10. Capacity sizing (starter)

Proposed starter sizing for a single-region deployment supporting 1,000 concurrent players at peak:

Tier Spec Rationale
API replicas (api) 2 × 4 vCPU / 8 GB Stateless; horizontal scale
RT replica (rt) 1 × 2 vCPU / 4 GB Sticky sessions; scale up on socket count
Postgres 1 × 8 vCPU / 32 GB / 500 GB SSD with HA standby Single source of truth; HA gives failover
Redis cache 1 × 2 vCPU / 8 GB with HA BullMQ + sessions + rate limit + pub/sub
Redis bot 1 × 1 vCPU / 2 GB Bot-fleet isolation only; can be tiny
RabbitMQ 1 × 1 vCPU / 2 GB Stub today; reserved for FastTrack roadmap
Game servers (bj, bo, speed-roulette) 1 × 2 vCPU / 4 GB each Light; speed-roulette concurrency=1 per round
Observability stack 1 × 4 vCPU / 16 GB Jaeger + Prom + Loki + Grafana co-hosted; separate from app

Scaling to 10k concurrent: roughly linear in api and rt replica count; Postgres / Redis sized vertically. {{TBD: validate against #66 stepped-ramp run.}}