Runbook: Captcha provider down — break-glass¶
Symptom¶
Sign-up and sign-in 4xx rate spikes; players can't authenticate. Visible as one or more of:
POST /auth/sign-up,POST /auth/sign-in,POST /auth/forgot-passwordreturningAUTH_INVALID_CAPTCHAat high rate.- Loki:
{service_name="ebit-api"} |= "Recaptcha unexpected error" or |= "AUTH_INVALID_CAPTCHA". - Browser DevTools → Network: requests to
https://www.google.com/recaptcha/api/siteverifyfailing or timing out (server-side, not visible to the user but logged). - Status page for Google reCAPTCHA shows incident: https://status.cloud.google.com/ (search "reCAPTCHA").
Important — provider is Google reCAPTCHA, not GeeTest. The implementation at
apps/api/src/captcha/google/recaptcha.service.ts:50-54callshttps://www.google.com/recaptcha/api/siteverifydirectly with no provider abstraction. The handover-kit prompt mentioning GeeTest was a hint that didn't match the live code — do not page GeeTest support.
Likely causes¶
- Google reCAPTCHA outage — provider-side incident; we have zero levers to pull on the upstream.
- Network egress blocked — ebit-api can't reach
www.google.com; firewall change or DNS issue. - Misconfigured
RECAPTCHA_SECRET— secret rotated upstream but Doppler not updated; every verification returnssuccess: false. - Backend regression —
RecaptchaServiceexception not handled at the call site; new endpoint missing the bypass for legitimate retries. - Token replay rate-limit — the
@IdempotencyLockatrecaptcha.service.ts:46rejects the same token twice within 5s; legitimate retries can hit this on a slow client.
Detection¶
- Grafana —
service-overview:ebit-api4xx rate panel; filter by route/auth/sign-upand/auth/sign-in. - Loki:
{service_name="ebit-api"} |= "AUTH_INVALID_CAPTCHA"rate; baseline ~0–1/min, alert at >10/min. - External: https://status.cloud.google.com/ for the upstream provider; correlate timeline.
First-response — confirm the issue is upstream, not us¶
Run all four checks in parallel; ~30 seconds.
1. Hit the provider directly from inside the api container¶
docker exec ebit-api curl -sI --max-time 5 https://www.google.com/recaptcha/api/siteverify | head -1
| Output | Diagnosis |
|---|---|
HTTP/2 405 (or 400) |
Provider is reachable; api-side issue. Skip to §Common-causes 3–5. |
HTTP/2 5xx |
Provider impaired; go to §Break-glass. |
curl: (28) Operation timed out |
Egress / DNS broken; not an upstream issue. Check network before break-glass. |
2. Check the upstream status page¶
curl -s https://status.cloud.google.com/incidents.json | jq '.[] | select(.service_name | contains("reCAPTCHA"))' | head -20
If a reCAPTCHA-tagged incident is open — we are downstream of it. Confirm in #oncall before applying break-glass.
3. Verify our secret hasn't been clobbered¶
docker exec ebit-api sh -c 'echo "secret length: ${#RECAPTCHA_SECRET}"'
# Expected: 40 (Google secret keys are 40 chars)
If it's 0, the secret didn't load — Doppler or env-cmd issue.
4. Confirm the local-dev bypass is not active in the affected environment¶
isLocal (in apps/api/src/captcha/google/recaptcha.service.ts:28) is true only when NODE_ENV=local; the bypass if (isLocal && token === 'pass') return; lets the literal token pass through. This must be false in any deployed environment. If a non-local environment shows NODE_ENV=local, that is the actual incident — fix the env, not the captcha.
Break-glass options¶
Apply at most one. Document the choice and the start time in the incident channel before flipping any switch.
Option A (preferred, if available) — fail over to a backup provider¶
The current code path is single-provider, single-URL hardcoded (recaptcha.service.ts:51). There is no CAPTCHA_PROVIDER env var, no backup provider configured. This option is therefore not currently available — it is the right design but it has not been implemented. {{TBD: implement provider abstraction with at least one backup (hCaptcha or Cloudflare Turnstile). Tracked as ADR candidate.}}.
If you have already shipped the abstraction by the time you're reading this, flip via Doppler:
doppler secrets set --project ebit --config <env> CAPTCHA_PROVIDER=hcaptcha
docker compose restart ebit-api
Option B (extreme) — disable captcha verification entirely¶
This option is not currently implemented in code — there is no DISABLE_CAPTCHA env var. The only bypass is isLocal && token === 'pass' (recaptcha.service.ts:28), which is local-only.
To genuinely disable in production, you would either:
- Patch + deploy a hotfix that adds
if (process.env.DISABLE_CAPTCHA === 'true') return;tovalidateRecaptcha(5-line change, requires a deploy — not break-glass-fast).{{TBD: ship this as a code change so it's available the next time}}, or - Temporarily flip
NODE_ENV=localon the api service and have the FE submittoken=pass. This is a hack; it disables several otherisLocal-gated guards too. Don't do this in production unless the alternative is a full sign-in outage > 30 min and IC + security have signed off.
If you have the patch from the prior bullet, the procedure is:
doppler secrets set --project ebit --config <env> DISABLE_CAPTCHA=true
docker compose restart ebit-api
# Hard 30-minute timer. Set a calendar alarm. Flip back the moment provider recovers.
Risks while disabled:
- Bot-driven sign-up surge — manifests as a flood of new accounts.
- Brute-force sign-in surge — the lockout counter (SF-002) still applies but its bug means it resets after TTL.
- Password-reset abuse —
forgot-passwordbecomes an enumeration oracle for valid emails.
Mitigations during the window: rate-limit at the edge (CDN / WAF) on /auth/* paths; keep the lockout TTL short; monitor sign-up + sign-in rates and stop the moment they look abusive.
Option C (controlled) — extend token TTL / loosen idempotency¶
Not directly supported by the current code either. The idempotency lock at recaptcha.service.ts:46 uses lockTtl: 5000 (hardcoded). You could:
- Patch + deploy
lockTtlto read from env (process.env.RECAPTCHA_IDEMPOTENCY_TTL), default 5000. This buys clients a longer reuse window for the same token if Google is intermittently 5xx-ing — fewer re-issuances needed. Modest mitigation; not a real break-glass. - For end-user retries during partial outage, a longer FE token cache is more effective than backend changes.
{{TBD: FE-side reCAPTCHA token caching policy — ebit-fe currently fetches a fresh token per submit attempt}}.
Verification¶
After provider recovery (or break-glass):
AUTH_INVALID_CAPTCHArate drops back to baseline (~0/min) in Loki.POST /auth/sign-in200 rate returns to typical: hit/auth/sign-infrom a real browser session and confirm the cookie set.- Provider URL responds:
docker exec ebit-api curl -sI --max-time 5 https://www.google.com/recaptcha/api/siteverifyreturns 405 (the expected method-not-allowed for HEAD) — i.e., reachable. - If you flipped a break-glass setting: revert it. Confirm the revert via
doppler secrets get --project ebit --config <env> <var>and thendocker compose restart ebit-api. - End-to-end: sign up a fresh account; confirm captcha-gated path runs cleanly.
Postmortem checklist¶
The incident is your prompt to fix the gaps that made it harder than it had to be.
- Did Grafana have a panel for
AUTH_INVALID_CAPTCHArate? If not, add one to theservice-overviewdashboard. - Was the upstream status page being polled (alert on Google reCAPTCHA incidents)? If not, file a follow-up to wire it.
- Was the runbook (this file) accurate when consulted? Note any deviation and update.
- Is Option A still unimplemented? If yes, this is the second time we noticed; raise priority.
- File the RCA per
../incidents/0000-template.md.
Prevention¶
- Provider abstraction: ship
CAPTCHA_PROVIDERenv var + at least one backup (hCaptcha or Cloudflare Turnstile). The pluggable shape is in the ADR candidate noted under §Break-glass A. - Status-page poller: a tiny scheduled job that hits
https://status.cloud.google.com/incidents.jsonevery 5 min and posts to#oncallon a reCAPTCHA-tagged incident. Cheap and removes minutes from detection. - Edge rate-limit: every
/auth/*route should be rate-limited at the edge (CDN / WAF) regardless of captcha state — captcha is a quality gate, not the only abuse defense. - Audit
isLocalchecks:apps/api/src/captcha/google/recaptcha.service.ts:28is one of several. Any production-critical guard that bypasses onisLocalis a potential foot-gun ifNODE_ENVis misconfigured. List them in the security register; review at every onboarding rotation.
Cross-references¶
recaptcha-fails-locally.md— the local-dev counterpart (token=passbypass, NODE_ENV gating)login-fails-bcrypt.md— adjacent auth failure mode../flows/dropbet-sign-up.md— the full sign-up flow that depends on captcha../security-register.md— auth findings (SF-001, SF-002 lockout) that interact with captcha-disabled mode../handover/oncall-runbook.md§3 — first-response procedure that brought you here