On-call Readiness Gate¶

Goal: drive a simulated P2 incident solo end-to-end, author one new runbook, read the security register and API changelog, and sign off readiness for on-call rotation. This is the final gate before joining the on-call rotation — previously labelled "Onboarding day 14 / end of week 2".

This page builds on ../onboarding/curriculum.md — it assumes you have completed the 2-week curriculum, read all 15 flow docs, know the Grafana dashboards by name, and completed at least one outage drill.

1. Drive a simulated P2 incident solo¶

Pick one of the scenarios below (or coordinate with your team lead for a scenario they want you to walk through). Drive it end-to-end without ebit-team intervention — read the oncall-runbook.md once before you start, then run.

Suggested scenarios (each maps to a real production failure mode):

BullMQ back-pressure — start the smoke profile, then docker compose stop ebit-api for 60 seconds and bring it back. Watch bet_settled_queue accumulate, then drain. Practice the "is the depth normal?" judgment call.
Redis cache flush — under steady traffic, docker compose exec ebit-redis-cache redis-cli -a cache FLUSHDB. Observe the cold-cache latency spike, the BullMQ side-effects (jobs lost or replayed), and recovery time.
WS handshake storm — spin up 200 socket.io clients in a tight loop (or use tests-perf/k6/scenarios/ws-storm.js), observe the throttler's behavior, decide whether to raise the limit or block the source.
Sign-in failure spike — set ADMIN_DEFAULT_PASSWORD to something wrong via env override and restart ebit-api. Watch the auth error rate climb, drive it back to zero by reverting.

For each scenario, follow the first-five-minute checklist from oncall-runbook.md:

Acknowledge in the (mock) incident channel.
Open the ebit-perf-test and perf-system Grafana dashboards.
Query Jaeger for the last 15 min of error spans.
Check Loki for error logs.
Identify whether this is a recurrence — search ../runbooks/ and your scenario notes.
(Skip the page-the-on-call-lead step in the drill — but say out loud "if this were P0/P1 I would page now".)

Time yourself. Aim for symptom → root cause hypothesis in under 10 minutes. Write up the timeline as you go (paste-into-Slack format) — this is the artifact you'd post to a real incident channel.

2. Author one new runbook¶

Pick the gap you noted on day 7. Write a runbook for it in ../runbooks/ using the template documented in ../runbooks/README.md:

Symptom — what the operator sees (one paragraph).
Likely causes — bulleted list, ordered by frequency.
Diagnosis — copy-pasteable commands. Include both the local-stack form (docker compose exec …) and a deployed-environment form (kubectl exec / aws ecs execute-command / whatever applies).
Fix — step by step, with rollback if applicable.
Prevention — what monitoring or code change avoids recurrence.

Open a PR. Have a senior engineer review. Update ../runbooks/README.md's index table. The PR is your week-2 deliverable.

Common candidate gaps (any of these is fair game):

Postgres connection pool exhaustion
DB-down recovery (you ran the drill in week 1)
Captcha provider outage (see ../flows/dropbet-sign-up.md)
WS adapter scale-out failure (@socket.io/redis-adapter not installed — known weakness AF-3)
Affiliate webhook backlog
Speed-roulette state-queue deadlock (concurrency: 1 design — see ../flows/dropbet-speed-roulette.md)

Pick one, write it, ship it.

3. Review the security register¶

Read ../security-register.md cover to cover. It catalogs every known finding (SF-001 through the latest), each scored, each linked to the source file and line. The customer-safe finding bodies live under ../security-findings/ (three are written; the rest are summarized in the register itself).

Pay attention to the categories that affect on-call:

Auth — SF-001 (timing oracle), SF-002 (lockout counter resets), SF-029 (SuperAdmin MFA bypass)
Bet pipeline — SF-006 (no DB-level overdraft check), SF-007 (settle side-effects fire-and-forget)
Wallet — SF-013 (toVault has no overdraft guard, balance can go negative)
Admin — SF-008 (commented-out JwtGuard on bet detail), SF-015 (transactions only over WS)

If any finding has a mitigation status of pending and you can see a clean fix path, raise it with the team lead — your fresh eyes are the most valuable they'll be in the first month.

4. Review the API changelog¶

Read the API changelog at ../api-reference/ (the version log is at ../api-reference/index.md plus any per-version snapshots; the dedicated api/changelog.md page is {{TBD: not yet authored — see task #12 in the docs portal}}). For each version delta, note:

Which endpoints were added, deprecated, or removed.
Which DTOs changed shape (those are the high-risk ones for client compatibility).
Whether the change required a Prisma migration (cross-reference with ebit-api/prisma/migrations/).

This is the surface the customer team will need to track every release. The on-call expectation is that you can correlate "this endpoint started 502'ing today" with "we shipped X yesterday".

5. Validate the escalation matrix¶

Open escalation-matrix.md. Walk every row top to bottom.

For each severity tier, confirm:

Names and contact paths are filled in (no {{TBD}}).
The PagerDuty schedule resolves to a real on-call human at the current minute.
The Slack channel exists and you can post to it.
The video bridge URL works.
The customer-team Tier 3 escalation path to the Evospin team is documented and tested (a single test ping per channel, ack confirmed).

Anything that doesn't validate goes on the on-call go/no-go list as a blocker.

6. Sign off readiness — shadow + reverse-shadow¶

The last gate before you take the pager solo:

Shadow shift — sit alongside an experienced on-call for one full shift (whatever your rotation cadence is — typically 12 or 24 hours). Watch them triage, write down questions, ask after each incident.
Reverse-shadow shift — take the pager yourself, with the experienced on-call sitting alongside ready to step in. They observe; you drive. You only get help if you explicitly ask, or if you're about to take a destructive action they think is wrong.

After both shifts, both engineers sign the go/no-go below. Either signing "no" sends you back for another rotation.

End-of-week-2 checklist (on-call go/no-go)¶

Every box must be checked for the engineer to be added to the active on-call rotation.

[ ] One simulated P2 incident driven end-to-end, with timeline write-up archived
[ ] One new runbook authored, reviewed, and merged into ../runbooks/
[ ] Security register read; pending findings discussed with team lead
[ ] API changelog reviewed; recent deltas understood
[ ] Escalation matrix validated end-to-end (every channel, every name)
[ ] Shadow shift complete; observations logged
[ ] Reverse-shadow shift complete; experienced on-call signs off
[ ] You have access to: PagerDuty, the on-call Slack channel, Grafana, Jaeger, Loki, the production AWS account (read-only at minimum), and the runbooks repo
[ ] You can answer "what's the first thing you do when the pager fires?" without thinking (acknowledge → open dashboards → search Jaeger errors → identify recurrence)
[ ] You and the experienced on-call have agreed go/no-go in writing

After sign-off, you're on the rotation. Your first solo shift is paired with a Tier 2 senior available within ~1 hour for the first month — see support-model.md for the ramp-down schedule.