Skip to content

On-call Readiness Gate

Goal: drive a simulated P2 incident solo end-to-end, author one new runbook, read the security register and API changelog, and sign off readiness for on-call rotation. This is the final gate before joining the on-call rotation — previously labelled "Onboarding day 14 / end of week 2".

This page builds on ../onboarding/curriculum.md — it assumes you have completed the 2-week curriculum, read all 15 flow docs, know the Grafana dashboards by name, and completed at least one outage drill.


1. Drive a simulated P2 incident solo

Pick one of the scenarios below (or coordinate with your team lead for a scenario they want you to walk through). Drive it end-to-end without ebit-team intervention — read the oncall-runbook.md once before you start, then run.

Suggested scenarios (each maps to a real production failure mode):

  • BullMQ back-pressure — start the smoke profile, then docker compose stop ebit-api for 60 seconds and bring it back. Watch bet_settled_queue accumulate, then drain. Practice the "is the depth normal?" judgment call.
  • Redis cache flush — under steady traffic, docker compose exec ebit-redis-cache redis-cli -a cache FLUSHDB. Observe the cold-cache latency spike, the BullMQ side-effects (jobs lost or replayed), and recovery time.
  • WS handshake storm — spin up 200 socket.io clients in a tight loop (or use tests-perf/k6/scenarios/ws-storm.js), observe the throttler's behavior, decide whether to raise the limit or block the source.
  • Sign-in failure spike — set ADMIN_DEFAULT_PASSWORD to something wrong via env override and restart ebit-api. Watch the auth error rate climb, drive it back to zero by reverting.

For each scenario, follow the first-five-minute checklist from oncall-runbook.md:

  1. Acknowledge in the (mock) incident channel.
  2. Open the ebit-perf-test and perf-system Grafana dashboards.
  3. Query Jaeger for the last 15 min of error spans.
  4. Check Loki for error logs.
  5. Identify whether this is a recurrence — search ../runbooks/ and your scenario notes.
  6. (Skip the page-the-on-call-lead step in the drill — but say out loud "if this were P0/P1 I would page now".)

Time yourself. Aim for symptom → root cause hypothesis in under 10 minutes. Write up the timeline as you go (paste-into-Slack format) — this is the artifact you'd post to a real incident channel.


2. Author one new runbook

Pick the gap you noted on day 7. Write a runbook for it in ../runbooks/ using the template documented in ../runbooks/README.md:

  1. Symptom — what the operator sees (one paragraph).
  2. Likely causes — bulleted list, ordered by frequency.
  3. Diagnosis — copy-pasteable commands. Include both the local-stack form (docker compose exec …) and a deployed-environment form (kubectl exec / aws ecs execute-command / whatever applies).
  4. Fix — step by step, with rollback if applicable.
  5. Prevention — what monitoring or code change avoids recurrence.

Open a PR. Have a senior engineer review. Update ../runbooks/README.md's index table. The PR is your week-2 deliverable.

Common candidate gaps (any of these is fair game):

  • Postgres connection pool exhaustion
  • DB-down recovery (you ran the drill in week 1)
  • Captcha provider outage (see ../flows/dropbet-sign-up.md)
  • WS adapter scale-out failure (@socket.io/redis-adapter not installed — known weakness AF-3)
  • Affiliate webhook backlog
  • Speed-roulette state-queue deadlock (concurrency: 1 design — see ../flows/dropbet-speed-roulette.md)

Pick one, write it, ship it.


3. Review the security register

Read ../security-register.md cover to cover. It catalogs every known finding (SF-001 through the latest), each scored, each linked to the source file and line. The customer-safe finding bodies live under ../security-findings/ (three are written; the rest are summarized in the register itself).

Pay attention to the categories that affect on-call:

  • Auth — SF-001 (timing oracle), SF-002 (lockout counter resets), SF-029 (SuperAdmin MFA bypass)
  • Bet pipeline — SF-006 (no DB-level overdraft check), SF-007 (settle side-effects fire-and-forget)
  • Wallet — SF-013 (toVault has no overdraft guard, balance can go negative)
  • Admin — SF-008 (commented-out JwtGuard on bet detail), SF-015 (transactions only over WS)

If any finding has a mitigation status of pending and you can see a clean fix path, raise it with the team lead — your fresh eyes are the most valuable they'll be in the first month.


4. Review the API changelog

Read the API changelog at ../api-reference/ (the version log is at ../api-reference/index.md plus any per-version snapshots; the dedicated api/changelog.md page is {{TBD: not yet authored — see task #12 in the docs portal}}). For each version delta, note:

  • Which endpoints were added, deprecated, or removed.
  • Which DTOs changed shape (those are the high-risk ones for client compatibility).
  • Whether the change required a Prisma migration (cross-reference with ebit-api/prisma/migrations/).

This is the surface the customer team will need to track every release. The on-call expectation is that you can correlate "this endpoint started 502'ing today" with "we shipped X yesterday".


5. Validate the escalation matrix

Open escalation-matrix.md. Walk every row top to bottom.

For each severity tier, confirm:

  • Names and contact paths are filled in (no {{TBD}}).
  • The PagerDuty schedule resolves to a real on-call human at the current minute.
  • The Slack channel exists and you can post to it.
  • The video bridge URL works.
  • The customer-team Tier 3 escalation path to the Evospin team is documented and tested (a single test ping per channel, ack confirmed).

Anything that doesn't validate goes on the on-call go/no-go list as a blocker.


6. Sign off readiness — shadow + reverse-shadow

The last gate before you take the pager solo:

  1. Shadow shift — sit alongside an experienced on-call for one full shift (whatever your rotation cadence is — typically 12 or 24 hours). Watch them triage, write down questions, ask after each incident.
  2. Reverse-shadow shift — take the pager yourself, with the experienced on-call sitting alongside ready to step in. They observe; you drive. You only get help if you explicitly ask, or if you're about to take a destructive action they think is wrong.

After both shifts, both engineers sign the go/no-go below. Either signing "no" sends you back for another rotation.


End-of-week-2 checklist (on-call go/no-go)

Every box must be checked for the engineer to be added to the active on-call rotation.

  • [ ] One simulated P2 incident driven end-to-end, with timeline write-up archived
  • [ ] One new runbook authored, reviewed, and merged into ../runbooks/
  • [ ] Security register read; pending findings discussed with team lead
  • [ ] API changelog reviewed; recent deltas understood
  • [ ] Escalation matrix validated end-to-end (every channel, every name)
  • [ ] Shadow shift complete; observations logged
  • [ ] Reverse-shadow shift complete; experienced on-call signs off
  • [ ] You have access to: PagerDuty, the on-call Slack channel, Grafana, Jaeger, Loki, the production AWS account (read-only at minimum), and the runbooks repo
  • [ ] You can answer "what's the first thing you do when the pager fires?" without thinking (acknowledge → open dashboards → search Jaeger errors → identify recurrence)
  • [ ] You and the experienced on-call have agreed go/no-go in writing

After sign-off, you're on the rotation. Your first solo shift is paired with a Tier 2 senior available within ~1 hour for the first month — see support-model.md for the ramp-down schedule.