On-call Readiness Gate¶
Goal: drive a simulated P2 incident solo end-to-end, author one new runbook, read the security register and API changelog, and sign off readiness for on-call rotation. This is the final gate before joining the on-call rotation — previously labelled "Onboarding day 14 / end of week 2".
This page builds on ../onboarding/curriculum.md — it assumes you have completed the 2-week curriculum, read all 15 flow docs, know the Grafana dashboards by name, and completed at least one outage drill.
1. Drive a simulated P2 incident solo¶
Pick one of the scenarios below (or coordinate with your team lead for a scenario they want you to walk through). Drive it end-to-end without ebit-team intervention — read the oncall-runbook.md once before you start, then run.
Suggested scenarios (each maps to a real production failure mode):
- BullMQ back-pressure — start the smoke profile, then
docker compose stop ebit-apifor 60 seconds and bring it back. Watchbet_settled_queueaccumulate, then drain. Practice the "is the depth normal?" judgment call. - Redis cache flush — under steady traffic,
docker compose exec ebit-redis-cache redis-cli -a cache FLUSHDB. Observe the cold-cache latency spike, the BullMQ side-effects (jobs lost or replayed), and recovery time. - WS handshake storm — spin up 200 socket.io clients in a tight loop (or use
tests-perf/k6/scenarios/ws-storm.js), observe the throttler's behavior, decide whether to raise the limit or block the source. - Sign-in failure spike — set
ADMIN_DEFAULT_PASSWORDto something wrong via env override and restartebit-api. Watch the auth error rate climb, drive it back to zero by reverting.
For each scenario, follow the first-five-minute checklist from oncall-runbook.md:
- Acknowledge in the (mock) incident channel.
- Open the
ebit-perf-testandperf-systemGrafana dashboards. - Query Jaeger for the last 15 min of error spans.
- Check Loki for error logs.
- Identify whether this is a recurrence — search
../runbooks/and your scenario notes. - (Skip the page-the-on-call-lead step in the drill — but say out loud "if this were P0/P1 I would page now".)
Time yourself. Aim for symptom → root cause hypothesis in under 10 minutes. Write up the timeline as you go (paste-into-Slack format) — this is the artifact you'd post to a real incident channel.
2. Author one new runbook¶
Pick the gap you noted on day 7. Write a runbook for it in ../runbooks/ using the template documented in ../runbooks/README.md:
- Symptom — what the operator sees (one paragraph).
- Likely causes — bulleted list, ordered by frequency.
- Diagnosis — copy-pasteable commands. Include both the local-stack form (
docker compose exec …) and a deployed-environment form (kubectl exec/aws ecs execute-command/ whatever applies). - Fix — step by step, with rollback if applicable.
- Prevention — what monitoring or code change avoids recurrence.
Open a PR. Have a senior engineer review. Update ../runbooks/README.md's index table. The PR is your week-2 deliverable.
Common candidate gaps (any of these is fair game):
- Postgres connection pool exhaustion
- DB-down recovery (you ran the drill in week 1)
- Captcha provider outage (see
../flows/dropbet-sign-up.md) - WS adapter scale-out failure (
@socket.io/redis-adapternot installed — known weakness AF-3) - Affiliate webhook backlog
- Speed-roulette state-queue deadlock (
concurrency: 1design — see../flows/dropbet-speed-roulette.md)
Pick one, write it, ship it.
3. Review the security register¶
Read ../security-register.md cover to cover. It catalogs every known finding (SF-001 through the latest), each scored, each linked to the source file and line. The customer-safe finding bodies live under ../security-findings/ (three are written; the rest are summarized in the register itself).
Pay attention to the categories that affect on-call:
- Auth — SF-001 (timing oracle), SF-002 (lockout counter resets), SF-029 (SuperAdmin MFA bypass)
- Bet pipeline — SF-006 (no DB-level overdraft check), SF-007 (settle side-effects fire-and-forget)
- Wallet — SF-013 (toVault has no overdraft guard, balance can go negative)
- Admin — SF-008 (commented-out JwtGuard on bet detail), SF-015 (transactions only over WS)
If any finding has a mitigation status of pending and you can see a clean fix path, raise it with the team lead — your fresh eyes are the most valuable they'll be in the first month.
4. Review the API changelog¶
Read the API changelog at ../api-reference/ (the version log is at ../api-reference/index.md plus any per-version snapshots; the dedicated api/changelog.md page is {{TBD: not yet authored — see task #12 in the docs portal}}). For each version delta, note:
- Which endpoints were added, deprecated, or removed.
- Which DTOs changed shape (those are the high-risk ones for client compatibility).
- Whether the change required a Prisma migration (cross-reference with
ebit-api/prisma/migrations/).
This is the surface the customer team will need to track every release. The on-call expectation is that you can correlate "this endpoint started 502'ing today" with "we shipped X yesterday".
5. Validate the escalation matrix¶
Open escalation-matrix.md. Walk every row top to bottom.
For each severity tier, confirm:
- Names and contact paths are filled in (no
{{TBD}}). - The PagerDuty schedule resolves to a real on-call human at the current minute.
- The Slack channel exists and you can post to it.
- The video bridge URL works.
- The customer-team Tier 3 escalation path to the Evospin team is documented and tested (a single test ping per channel, ack confirmed).
Anything that doesn't validate goes on the on-call go/no-go list as a blocker.
6. Sign off readiness — shadow + reverse-shadow¶
The last gate before you take the pager solo:
- Shadow shift — sit alongside an experienced on-call for one full shift (whatever your rotation cadence is — typically 12 or 24 hours). Watch them triage, write down questions, ask after each incident.
- Reverse-shadow shift — take the pager yourself, with the experienced on-call sitting alongside ready to step in. They observe; you drive. You only get help if you explicitly ask, or if you're about to take a destructive action they think is wrong.
After both shifts, both engineers sign the go/no-go below. Either signing "no" sends you back for another rotation.
End-of-week-2 checklist (on-call go/no-go)¶
Every box must be checked for the engineer to be added to the active on-call rotation.
- [ ] One simulated P2 incident driven end-to-end, with timeline write-up archived
- [ ] One new runbook authored, reviewed, and merged into
../runbooks/ - [ ] Security register read; pending findings discussed with team lead
- [ ] API changelog reviewed; recent deltas understood
- [ ] Escalation matrix validated end-to-end (every channel, every name)
- [ ] Shadow shift complete; observations logged
- [ ] Reverse-shadow shift complete; experienced on-call signs off
- [ ] You have access to: PagerDuty, the on-call Slack channel, Grafana, Jaeger, Loki, the production AWS account (read-only at minimum), and the runbooks repo
- [ ] You can answer "what's the first thing you do when the pager fires?" without thinking (acknowledge → open dashboards → search Jaeger errors → identify recurrence)
- [ ] You and the experienced on-call have agreed go/no-go in writing
After sign-off, you're on the rotation. Your first solo shift is paired with a Tier 2 senior available within ~1 hour for the first month — see support-model.md for the ramp-down schedule.