Non-Functional Requirements & SLA Targets¶

Audience: Customer's CTO / SRE lead / commercial team. Read with ../performance-testing.md and ../performance-test-report.md open. Status of numbers: Latency targets are defined in the perf methodology and measured in the perf test report. Availability, RTO, RPO, throughput-at-scale: proposed starting SLAs marked {{TBD}} until the next stepped-ramp run finishes (issue #66 / #67).

1. Latency targets (per endpoint class)¶

The platform's measured SLOs from docs/performance-testing.md §4. These are the contractual numbers we propose for an SLA.

Endpoint class	Method	p95 target	Notes
`POST /auth/sign-in`	POST	150 ms	bcrypt is irreducible at ~60–80 ms per hash; budget allows for DB lookup + session creation
`POST /casino/games/house/dice/bet` (and other house-game bets)	POST	100 ms	Baseline at 1 VU = 108 ms — endpoint does not currently meet SLO at any load. Pre-noted in `../performance-testing.md`. Optimisation work is required before we can ship this number contractually. {{TBD: target revision after #66 run}}
`GET /bets` (paginated bet history)	GET	50 ms	Read-mostly; cache-friendly
`GET /accounting/balances`	GET	50 ms	Cached balance lookup
WebSocket handshake (rt `/events`)	WS	200 ms	Time from TCP connect → `AuthSuccess` event

p50 / p99: not formally targeted in the methodology today; we recommend treating p99 as 2 × p95 for SLA carve-outs. {{TBD: explicit p50 / p99 numbers per endpoint after stepped-ramp run completes}}

System-level error budget: < 0.1 % error rate per endpoint across the test window.

2. Availability target¶

Tier	Proposed SLA	Basis
Initial commercial SLA	99.5 % monthly uptime (~3.6 h downtime / month)	Conservative starting position pending production telemetry from a customer install
Aspirational target	99.9 % monthly uptime (~43 min / month)	Achievable once HA Postgres + Redis + load-balanced API replicas are deployed; not guaranteed by single-node compose
Game-server availability	99.5 % for `apps/speed-roulette`	Hard dependency on EOS RPC (`../architecture.md` §1) — outage of EOS stalls round queue. Budget reflects external-dependency risk

The 99.5 % figure is a proposal, not a measured value. Final SLA depends on the customer's deployment topology (single-region vs multi-region, HA database, hot-standby Redis). See roadmap.md Phase 5.

Cost note: the 99.9 % aspirational tier presumes AWS Business Support (greater of $100/mo or 3–10 % of monthly AWS spend), multi-AZ Postgres + ElastiCache, paid Sentry, and PagerDuty. Costed in infrastructure-cost.md §2 (AWS) and §6 (hidden costs).

3. Throughput¶

The system was designed and is being load-tested for the following stepped ramp (../performance-testing.md §5):

Stage	Target VUs	Duration
Warmup	50	2 min
2	1,000	5 min
3	2,500	5 min
4	5,000	5 min
5	7,500	5 min
6 (peak)	10,000	5 min

Peak target: 10,000 concurrent virtual users with stepped ramp. Sustained QPS at peak: {{TBD}} — fill from performance-test-report.md §"Per-Endpoint SLO Scorecard" once #66 runs. First SLO breach point: {{TBD}} VUs on {{TBD endpoint}} (../performance-test-report.md).

4. Concurrency (websocket)¶

Limit	Value	Source
Max websocket connections per source IP	10 (configurable via `MAX_CONNECTIONS_PER_IP`)	`libs/ws-throttler/src/const.ts:3-4`; default in `.example.env:111`, `.local.env:144`, `.test.env:121`
WebSocket transport	websocket-only (polling explicitly disabled)	`apps/rt/src/main.ts` boots without HTTP polling fallback (`../architecture.md` §2)
Per-instance socket count	bounded by RAM (in-process Map; no Redis socket-state adapter shipped)	`../performance-testing.md` §6 — known observability gap; sticky sessions required if running multiple `rt` replicas

Practical implication: a customer wanting >10k concurrent player sessions on one rt instance should plan for horizontal scale (multiple rt replicas with sticky sessions) — see ../architecture.md §2 port contract for the rt service shape.

5. Recovery objectives¶

Objective	Proposed target	Basis
RTO (Recovery Time Objective)	1 hour	Single Postgres restore from latest snapshot + service redeploy ≤ 60 min on a sized instance. Multi-region failover not in scope of starter SLA. {{TBD: customer-specific based on their backup tooling}}
RPO (Recovery Point Objective)	5 minutes	Postgres WAL-archived to object storage every 5 min in proposed deployment topology. {{TBD: actual cadence is operator-deployment-specific; Evospin ships compose configs only}}
Disaster scenarios covered	Single-AZ failure, single-VM loss, accidental DROP, ransomware	Multi-region, regional outage = roadmap

Today's docker-compose.local.yml ships no automated backup. RTO/RPO numbers assume the customer (or ebit-team in a managed-deployment contract) wires Postgres WAL archival + scheduled snapshots in their infra.

6. Security posture¶

Authentication & authorisation¶

Control	How	Where
Password hashing	bcrypt (cost factor configurable in env)	`apps/api/src/auth/auth.service.ts`
JWT access + refresh	HS256-signed; refresh rotation; httpOnly secure cookies	`apps/api/src/auth/cookies.ts`, `strategies/jwt-refresh-strategy.ts`
2FA (TOTP)	otplib; `OtpGuard` mandatory on SuperAdmin routes	`apps/api/src/auth/guards/`
Role-based access	`RolesGuard` + `PermissionGuard('<key>')`	`libs/auth/`, `apps/api/src/auth/guards/`
Session store	Redis (cache instance) — short-lived JWT denylist + presence	`apps/api/src/auth/session/session.queue-producer.ts`
Rate limiting	Per-route sliding-window (Lua on cache Redis); per-IP WS cap	`apps/api/src/captcha/`, `libs/ws-throttler/`
Captcha	GeeTest v4 + reCAPTCHA fallback	`apps/api/src/captcha/`

Encryption¶

Layer	Status	Notes
In transit (player ↔ Evospin)	HTTPS / WSS required at the load balancer	Compose ships HTTP for local dev; production deployment is responsible for TLS termination (typically at ALB / Nginx). Customer responsibility
In transit (Evospin ↔ Postgres / Redis / external APIs)	TLS supported by all clients (Prisma, ioredis); operator-config	{{TBD per deployment}}
At rest (Postgres)	Operator-config (cloud-disk encryption / pgcrypto column-level)	Not enforced in the application layer
At rest (Redis)	Operator-config (redis password set; AOF persistence configurable)	Compose sets `cache` / `bot` passwords
Secrets	Doppler for dev (`run_local.sh`); production uses customer's secret manager (AWS Secrets Manager, Vault)	`.env` is gitignored; templates are `.example.env` / `.local.env` / `.test.env`

Audit & traceability¶

Every state-changing endpoint is wrapped in OTel server spans → Jaeger (see ../e2e-trace-demo.md: a single bet emits 69 spans).
Structured pino logs to Loki with trace IDs for trace↔log pivot.
Admin actions surface in apps/api/src/user/admin/notes/ (admin-notes table) for human-readable audit trail.
Sentry captures runtime errors with source maps for all three apps.

Known security findings¶

Tracked in ../security-register.md and ../security-findings/. Three findings filed at last review (sf-001, sf-002, sf-003); see register for current status.

7. Compliance¶

Operator-licensable, not ebit-licensable. Evospin ships the technical surfaces; the operator owns the gambling licence in their jurisdiction.

Surface	What Evospin provides	What the operator owns
KYC	Sumsub integration end-to-end (`apps/api/src/kyc/sumsub/`) — applicant lifecycle, doc upload, webhook-driven status state machine	Sumsub commercial relationship + jurisdiction-specific tier mapping
AML	Audit logs, transaction history, ledger immutability	Suspicious-transaction reporting workflow, regulator filings
Responsible gambling	`users-limits/` self-exclusion + deposit/loss-limit framework (⚠ TBD coverage)	Jurisdiction-specific limits (UK GamStop, etc.)
Provably-fair fairness	`apps/api/src/provably-fair/` server-seed rotation; EOS-anchored RNG for speed-roulette	Public verification page (operator-hosted) — TBD
Geo restriction	Country allow / deny list (`apps/api/src/country/`)	Jurisdiction-by-jurisdiction restrictions list
Data residency	Single-region Postgres by default	EU / US / regional deployment topology — operator picks
GDPR / data subject rights	User profile read endpoint, account-delete pathway (⚠ TBD verify completeness)	Privacy notice, data-controller registration

Jurisdiction support: Evospin is jurisdiction-agnostic at the code level. The operator chooses where to deploy and licence; Evospin's KYC / geo / responsible-gambling surfaces support the integration pattern but do not pre-configure for any one regulator.

8. Operational visibility (observability)¶

Already shipping:

Signal	Tool	Where
Distributed tracing	Jaeger v2 (port 16686)	`observability/`, `../observability.md`
Metrics (incl. spanmetrics-derived RED metrics)	Prometheus (port 9090)	same
Logs	Loki (port 3100) bridged from pino	same
Dashboards	Grafana (port 3003) with provisioned datasources + dashboards	same
Errors	Sentry (per-repo `sentry.*.config.ts`)	each repo
OTel ingress	otel-collector (gRPC :4317 / HTTP :4318)	`observability/otel-collector.yml`

This stack is the source of truth for SLA compliance reporting. See ../e2e-trace-demo.md for proof-of-life.

9. Maintenance windows¶

Activity	Cadence	Player-visible impact
Dependency / security patches	Monthly	Rolling restart per service — zero-downtime if API has ≥2 replicas; 1–2 min impact on single-replica deploys
Prisma migrations	Per release	Most are backward-compatible; destructive migrations require a maintenance window
Postgres major-version upgrade	Annual	30–60 min maintenance window
Redis upgrade	Annual	Brief failover window if HA, else 5–10 min

{{TBD: customer-specific maintenance contract — none ship in code today}}

10. Capacity sizing (starter)¶

Proposed starter sizing for a single-region deployment supporting 1,000 concurrent players at peak:

Tier	Spec	Rationale
API replicas (`api`)	2 × 4 vCPU / 8 GB	Stateless; horizontal scale
RT replica (`rt`)	1 × 2 vCPU / 4 GB	Sticky sessions; scale up on socket count
Postgres	1 × 8 vCPU / 32 GB / 500 GB SSD with HA standby	Single source of truth; HA gives failover
Redis cache	1 × 2 vCPU / 8 GB with HA	BullMQ + sessions + rate limit + pub/sub
Redis bot	1 × 1 vCPU / 2 GB	Bot-fleet isolation only; can be tiny
RabbitMQ	1 × 1 vCPU / 2 GB	Stub today; reserved for FastTrack roadmap
Game servers (`bj`, `bo`, `speed-roulette`)	1 × 2 vCPU / 4 GB each	Light; speed-roulette concurrency=1 per round
Observability stack	1 × 4 vCPU / 16 GB	Jaeger + Prom + Loki + Grafana co-hosted; separate from app

Scaling to 10k concurrent: roughly linear in api and rt replica count; Postgres / Redis sized vertically. {{TBD: validate against #66 stepped-ramp run.}}

Cross-links¶

value-proposition.md — why these targets are achievable.
integration-options.md — how each integration affects SLA.
roadmap.md — when SLA targets are validated.
responsibilities.md — who owns SLA monitoring + breach response.
../performance-testing.md — full methodology, k6 profiles, SLO definitions.
../performance-test-report.md — measured numbers from the latest run.
../security-register.md — security findings register.
../observability.md — full observability runbook.