Non-Functional Requirements & SLA Targets¶
Audience: Customer's CTO / SRE lead / commercial team. Read with
../performance-testing.mdand../performance-test-report.mdopen. Status of numbers: Latency targets are defined in the perf methodology and measured in the perf test report. Availability, RTO, RPO, throughput-at-scale: proposed starting SLAs marked{{TBD}}until the next stepped-ramp run finishes (issue #66 / #67).
1. Latency targets (per endpoint class)¶
The platform's measured SLOs from docs/performance-testing.md §4. These are the contractual numbers we propose for an SLA.
| Endpoint class | Method | p95 target | Notes |
|---|---|---|---|
POST /auth/sign-in |
POST | 150 ms | bcrypt is irreducible at ~60–80 ms per hash; budget allows for DB lookup + session creation |
POST /casino/games/house/dice/bet (and other house-game bets) |
POST | 100 ms | Baseline at 1 VU = 108 ms — endpoint does not currently meet SLO at any load. Pre-noted in ../performance-testing.md. Optimisation work is required before we can ship this number contractually. {{TBD: target revision after #66 run}} |
GET /bets (paginated bet history) |
GET | 50 ms | Read-mostly; cache-friendly |
GET /accounting/balances |
GET | 50 ms | Cached balance lookup |
WebSocket handshake (rt /events) |
WS | 200 ms | Time from TCP connect → AuthSuccess event |
p50 / p99: not formally targeted in the methodology today; we recommend treating p99 as 2 × p95 for SLA carve-outs. {{TBD: explicit p50 / p99 numbers per endpoint after stepped-ramp run completes}}
System-level error budget: < 0.1 % error rate per endpoint across the test window.
2. Availability target¶
| Tier | Proposed SLA | Basis |
|---|---|---|
| Initial commercial SLA | 99.5 % monthly uptime (~3.6 h downtime / month) | Conservative starting position pending production telemetry from a customer install |
| Aspirational target | 99.9 % monthly uptime (~43 min / month) | Achievable once HA Postgres + Redis + load-balanced API replicas are deployed; not guaranteed by single-node compose |
| Game-server availability | 99.5 % for apps/speed-roulette |
Hard dependency on EOS RPC (../architecture.md §1) — outage of EOS stalls round queue. Budget reflects external-dependency risk |
The 99.5 % figure is a proposal, not a measured value. Final SLA depends on the customer's deployment topology (single-region vs multi-region, HA database, hot-standby Redis). See
roadmap.mdPhase 5.Cost note: the 99.9 % aspirational tier presumes AWS Business Support (greater of $100/mo or 3–10 % of monthly AWS spend), multi-AZ Postgres + ElastiCache, paid Sentry, and PagerDuty. Costed in
infrastructure-cost.md§2 (AWS) and §6 (hidden costs).
3. Throughput¶
The system was designed and is being load-tested for the following stepped ramp (../performance-testing.md §5):
| Stage | Target VUs | Duration |
|---|---|---|
| Warmup | 50 | 2 min |
| 2 | 1,000 | 5 min |
| 3 | 2,500 | 5 min |
| 4 | 5,000 | 5 min |
| 5 | 7,500 | 5 min |
| 6 (peak) | 10,000 | 5 min |
Peak target: 10,000 concurrent virtual users with stepped ramp.
Sustained QPS at peak: {{TBD}} — fill from performance-test-report.md §"Per-Endpoint SLO Scorecard" once #66 runs.
First SLO breach point: {{TBD}} VUs on {{TBD endpoint}} (../performance-test-report.md).
4. Concurrency (websocket)¶
| Limit | Value | Source |
|---|---|---|
| Max websocket connections per source IP | 10 (configurable via MAX_CONNECTIONS_PER_IP) |
libs/ws-throttler/src/const.ts:3-4; default in .example.env:111, .local.env:144, .test.env:121 |
| WebSocket transport | websocket-only (polling explicitly disabled) | apps/rt/src/main.ts boots without HTTP polling fallback (../architecture.md §2) |
| Per-instance socket count | bounded by RAM (in-process Map; no Redis socket-state adapter shipped) | ../performance-testing.md §6 — known observability gap; sticky sessions required if running multiple rt replicas |
Practical implication: a customer wanting >10k concurrent player sessions on one rt instance should plan for horizontal scale (multiple rt replicas with sticky sessions) — see ../architecture.md §2 port contract for the rt service shape.
5. Recovery objectives¶
| Objective | Proposed target | Basis |
|---|---|---|
| RTO (Recovery Time Objective) | 1 hour | Single Postgres restore from latest snapshot + service redeploy ≤ 60 min on a sized instance. Multi-region failover not in scope of starter SLA. {{TBD: customer-specific based on their backup tooling}} |
| RPO (Recovery Point Objective) | 5 minutes | Postgres WAL-archived to object storage every 5 min in proposed deployment topology. {{TBD: actual cadence is operator-deployment-specific; Evospin ships compose configs only}} |
| Disaster scenarios covered | Single-AZ failure, single-VM loss, accidental DROP, ransomware | Multi-region, regional outage = roadmap |
Today's
docker-compose.local.ymlships no automated backup. RTO/RPO numbers assume the customer (or ebit-team in a managed-deployment contract) wires Postgres WAL archival + scheduled snapshots in their infra.
6. Security posture¶
Authentication & authorisation¶
| Control | How | Where |
|---|---|---|
| Password hashing | bcrypt (cost factor configurable in env) | apps/api/src/auth/auth.service.ts |
| JWT access + refresh | HS256-signed; refresh rotation; httpOnly secure cookies | apps/api/src/auth/cookies.ts, strategies/jwt-refresh-strategy.ts |
| 2FA (TOTP) | otplib; OtpGuard mandatory on SuperAdmin routes |
apps/api/src/auth/guards/ |
| Role-based access | RolesGuard + PermissionGuard('<key>') |
libs/auth/, apps/api/src/auth/guards/ |
| Session store | Redis (cache instance) — short-lived JWT denylist + presence | apps/api/src/auth/session/session.queue-producer.ts |
| Rate limiting | Per-route sliding-window (Lua on cache Redis); per-IP WS cap | apps/api/src/captcha/, libs/ws-throttler/ |
| Captcha | GeeTest v4 + reCAPTCHA fallback | apps/api/src/captcha/ |
Encryption¶
| Layer | Status | Notes |
|---|---|---|
| In transit (player ↔ Evospin) | HTTPS / WSS required at the load balancer | Compose ships HTTP for local dev; production deployment is responsible for TLS termination (typically at ALB / Nginx). Customer responsibility |
| In transit (Evospin ↔ Postgres / Redis / external APIs) | TLS supported by all clients (Prisma, ioredis); operator-config | {{TBD per deployment}} |
| At rest (Postgres) | Operator-config (cloud-disk encryption / pgcrypto column-level) | Not enforced in the application layer |
| At rest (Redis) | Operator-config (redis password set; AOF persistence configurable) | Compose sets cache / bot passwords |
| Secrets | Doppler for dev (run_local.sh); production uses customer's secret manager (AWS Secrets Manager, Vault) |
.env is gitignored; templates are .example.env / .local.env / .test.env |
Audit & traceability¶
- Every state-changing endpoint is wrapped in OTel server spans → Jaeger (see
../e2e-trace-demo.md: a single bet emits 69 spans). - Structured pino logs to Loki with trace IDs for trace↔log pivot.
- Admin actions surface in
apps/api/src/user/admin/notes/(admin-notes table) for human-readable audit trail. - Sentry captures runtime errors with source maps for all three apps.
Known security findings¶
Tracked in ../security-register.md and ../security-findings/. Three findings filed at last review (sf-001, sf-002, sf-003); see register for current status.
7. Compliance¶
Operator-licensable, not ebit-licensable. Evospin ships the technical surfaces; the operator owns the gambling licence in their jurisdiction.
| Surface | What Evospin provides | What the operator owns |
|---|---|---|
| KYC | Sumsub integration end-to-end (apps/api/src/kyc/sumsub/) — applicant lifecycle, doc upload, webhook-driven status state machine |
Sumsub commercial relationship + jurisdiction-specific tier mapping |
| AML | Audit logs, transaction history, ledger immutability | Suspicious-transaction reporting workflow, regulator filings |
| Responsible gambling | users-limits/ self-exclusion + deposit/loss-limit framework (⚠ TBD coverage) |
Jurisdiction-specific limits (UK GamStop, etc.) |
| Provably-fair fairness | apps/api/src/provably-fair/ server-seed rotation; EOS-anchored RNG for speed-roulette |
Public verification page (operator-hosted) — TBD |
| Geo restriction | Country allow / deny list (apps/api/src/country/) |
Jurisdiction-by-jurisdiction restrictions list |
| Data residency | Single-region Postgres by default | EU / US / regional deployment topology — operator picks |
| GDPR / data subject rights | User profile read endpoint, account-delete pathway (⚠ TBD verify completeness) | Privacy notice, data-controller registration |
Jurisdiction support: Evospin is jurisdiction-agnostic at the code level. The operator chooses where to deploy and licence; Evospin's KYC / geo / responsible-gambling surfaces support the integration pattern but do not pre-configure for any one regulator.
8. Operational visibility (observability)¶
Already shipping:
| Signal | Tool | Where |
|---|---|---|
| Distributed tracing | Jaeger v2 (port 16686) | observability/, ../observability.md |
| Metrics (incl. spanmetrics-derived RED metrics) | Prometheus (port 9090) | same |
| Logs | Loki (port 3100) bridged from pino | same |
| Dashboards | Grafana (port 3003) with provisioned datasources + dashboards | same |
| Errors | Sentry (per-repo sentry.*.config.ts) |
each repo |
| OTel ingress | otel-collector (gRPC :4317 / HTTP :4318) | observability/otel-collector.yml |
This stack is the source of truth for SLA compliance reporting. See ../e2e-trace-demo.md for proof-of-life.
9. Maintenance windows¶
| Activity | Cadence | Player-visible impact |
|---|---|---|
| Dependency / security patches | Monthly | Rolling restart per service — zero-downtime if API has ≥2 replicas; 1–2 min impact on single-replica deploys |
| Prisma migrations | Per release | Most are backward-compatible; destructive migrations require a maintenance window |
| Postgres major-version upgrade | Annual | 30–60 min maintenance window |
| Redis upgrade | Annual | Brief failover window if HA, else 5–10 min |
{{TBD: customer-specific maintenance contract — none ship in code today}}
10. Capacity sizing (starter)¶
Proposed starter sizing for a single-region deployment supporting 1,000 concurrent players at peak:
| Tier | Spec | Rationale |
|---|---|---|
API replicas (api) |
2 × 4 vCPU / 8 GB | Stateless; horizontal scale |
RT replica (rt) |
1 × 2 vCPU / 4 GB | Sticky sessions; scale up on socket count |
| Postgres | 1 × 8 vCPU / 32 GB / 500 GB SSD with HA standby | Single source of truth; HA gives failover |
| Redis cache | 1 × 2 vCPU / 8 GB with HA | BullMQ + sessions + rate limit + pub/sub |
| Redis bot | 1 × 1 vCPU / 2 GB | Bot-fleet isolation only; can be tiny |
| RabbitMQ | 1 × 1 vCPU / 2 GB | Stub today; reserved for FastTrack roadmap |
Game servers (bj, bo, speed-roulette) |
1 × 2 vCPU / 4 GB each | Light; speed-roulette concurrency=1 per round |
| Observability stack | 1 × 4 vCPU / 16 GB | Jaeger + Prom + Loki + Grafana co-hosted; separate from app |
Scaling to 10k concurrent: roughly linear in api and rt replica count; Postgres / Redis sized vertically. {{TBD: validate against #66 stepped-ramp run.}}
Cross-links¶
value-proposition.md— why these targets are achievable.integration-options.md— how each integration affects SLA.roadmap.md— when SLA targets are validated.responsibilities.md— who owns SLA monitoring + breach response.../performance-testing.md— full methodology, k6 profiles, SLO definitions.../performance-test-report.md— measured numbers from the latest run.../security-register.md— security findings register.../observability.md— full observability runbook.