Skip to content

Runbook: Captcha provider down — break-glass

Symptom

Sign-up and sign-in 4xx rate spikes; players can't authenticate. Visible as one or more of:

  • POST /auth/sign-up, POST /auth/sign-in, POST /auth/forgot-password returning AUTH_INVALID_CAPTCHA at high rate.
  • Loki: {service_name="ebit-api"} |= "Recaptcha unexpected error" or |= "AUTH_INVALID_CAPTCHA".
  • Browser DevTools → Network: requests to https://www.google.com/recaptcha/api/siteverify failing or timing out (server-side, not visible to the user but logged).
  • Status page for Google reCAPTCHA shows incident: https://status.cloud.google.com/ (search "reCAPTCHA").

Important — provider is Google reCAPTCHA, not GeeTest. The implementation at apps/api/src/captcha/google/recaptcha.service.ts:50-54 calls https://www.google.com/recaptcha/api/siteverify directly with no provider abstraction. The handover-kit prompt mentioning GeeTest was a hint that didn't match the live code — do not page GeeTest support.

Likely causes

  1. Google reCAPTCHA outage — provider-side incident; we have zero levers to pull on the upstream.
  2. Network egress blocked — ebit-api can't reach www.google.com; firewall change or DNS issue.
  3. Misconfigured RECAPTCHA_SECRET — secret rotated upstream but Doppler not updated; every verification returns success: false.
  4. Backend regressionRecaptchaService exception not handled at the call site; new endpoint missing the bypass for legitimate retries.
  5. Token replay rate-limit — the @IdempotencyLock at recaptcha.service.ts:46 rejects the same token twice within 5s; legitimate retries can hit this on a slow client.

Detection

  • Grafana — service-overview: ebit-api 4xx rate panel; filter by route /auth/sign-up and /auth/sign-in.
  • Loki: {service_name="ebit-api"} |= "AUTH_INVALID_CAPTCHA" rate; baseline ~0–1/min, alert at >10/min.
  • External: https://status.cloud.google.com/ for the upstream provider; correlate timeline.

First-response — confirm the issue is upstream, not us

Run all four checks in parallel; ~30 seconds.

1. Hit the provider directly from inside the api container

docker exec ebit-api curl -sI --max-time 5 https://www.google.com/recaptcha/api/siteverify | head -1
Output Diagnosis
HTTP/2 405 (or 400) Provider is reachable; api-side issue. Skip to §Common-causes 3–5.
HTTP/2 5xx Provider impaired; go to §Break-glass.
curl: (28) Operation timed out Egress / DNS broken; not an upstream issue. Check network before break-glass.

2. Check the upstream status page

curl -s https://status.cloud.google.com/incidents.json | jq '.[] | select(.service_name | contains("reCAPTCHA"))' | head -20

If a reCAPTCHA-tagged incident is open — we are downstream of it. Confirm in #oncall before applying break-glass.

3. Verify our secret hasn't been clobbered

docker exec ebit-api sh -c 'echo "secret length: ${#RECAPTCHA_SECRET}"'
# Expected: 40 (Google secret keys are 40 chars)

If it's 0, the secret didn't load — Doppler or env-cmd issue.

4. Confirm the local-dev bypass is not active in the affected environment

docker exec ebit-api sh -c 'echo "NODE_ENV: $NODE_ENV"'

isLocal (in apps/api/src/captcha/google/recaptcha.service.ts:28) is true only when NODE_ENV=local; the bypass if (isLocal && token === 'pass') return; lets the literal token pass through. This must be false in any deployed environment. If a non-local environment shows NODE_ENV=local, that is the actual incident — fix the env, not the captcha.

Break-glass options

Apply at most one. Document the choice and the start time in the incident channel before flipping any switch.

Option A (preferred, if available) — fail over to a backup provider

The current code path is single-provider, single-URL hardcoded (recaptcha.service.ts:51). There is no CAPTCHA_PROVIDER env var, no backup provider configured. This option is therefore not currently available — it is the right design but it has not been implemented. {{TBD: implement provider abstraction with at least one backup (hCaptcha or Cloudflare Turnstile). Tracked as ADR candidate.}}.

If you have already shipped the abstraction by the time you're reading this, flip via Doppler:

doppler secrets set --project ebit --config <env> CAPTCHA_PROVIDER=hcaptcha
docker compose restart ebit-api

Option B (extreme) — disable captcha verification entirely

This option is not currently implemented in code — there is no DISABLE_CAPTCHA env var. The only bypass is isLocal && token === 'pass' (recaptcha.service.ts:28), which is local-only.

To genuinely disable in production, you would either:

  • Patch + deploy a hotfix that adds if (process.env.DISABLE_CAPTCHA === 'true') return; to validateRecaptcha (5-line change, requires a deploy — not break-glass-fast). {{TBD: ship this as a code change so it's available the next time}}, or
  • Temporarily flip NODE_ENV=local on the api service and have the FE submit token=pass. This is a hack; it disables several other isLocal-gated guards too. Don't do this in production unless the alternative is a full sign-in outage > 30 min and IC + security have signed off.

If you have the patch from the prior bullet, the procedure is:

doppler secrets set --project ebit --config <env> DISABLE_CAPTCHA=true
docker compose restart ebit-api
# Hard 30-minute timer. Set a calendar alarm. Flip back the moment provider recovers.

Risks while disabled:

  • Bot-driven sign-up surge — manifests as a flood of new accounts.
  • Brute-force sign-in surge — the lockout counter (SF-002) still applies but its bug means it resets after TTL.
  • Password-reset abuse — forgot-password becomes an enumeration oracle for valid emails.

Mitigations during the window: rate-limit at the edge (CDN / WAF) on /auth/* paths; keep the lockout TTL short; monitor sign-up + sign-in rates and stop the moment they look abusive.

Option C (controlled) — extend token TTL / loosen idempotency

Not directly supported by the current code either. The idempotency lock at recaptcha.service.ts:46 uses lockTtl: 5000 (hardcoded). You could:

  • Patch + deploy lockTtl to read from env (process.env.RECAPTCHA_IDEMPOTENCY_TTL), default 5000. This buys clients a longer reuse window for the same token if Google is intermittently 5xx-ing — fewer re-issuances needed. Modest mitigation; not a real break-glass.
  • For end-user retries during partial outage, a longer FE token cache is more effective than backend changes. {{TBD: FE-side reCAPTCHA token caching policy — ebit-fe currently fetches a fresh token per submit attempt}}.

Verification

After provider recovery (or break-glass):

  1. AUTH_INVALID_CAPTCHA rate drops back to baseline (~0/min) in Loki.
  2. POST /auth/sign-in 200 rate returns to typical: hit /auth/sign-in from a real browser session and confirm the cookie set.
  3. Provider URL responds: docker exec ebit-api curl -sI --max-time 5 https://www.google.com/recaptcha/api/siteverify returns 405 (the expected method-not-allowed for HEAD) — i.e., reachable.
  4. If you flipped a break-glass setting: revert it. Confirm the revert via doppler secrets get --project ebit --config <env> <var> and then docker compose restart ebit-api.
  5. End-to-end: sign up a fresh account; confirm captcha-gated path runs cleanly.

Postmortem checklist

The incident is your prompt to fix the gaps that made it harder than it had to be.

  • Did Grafana have a panel for AUTH_INVALID_CAPTCHA rate? If not, add one to the service-overview dashboard.
  • Was the upstream status page being polled (alert on Google reCAPTCHA incidents)? If not, file a follow-up to wire it.
  • Was the runbook (this file) accurate when consulted? Note any deviation and update.
  • Is Option A still unimplemented? If yes, this is the second time we noticed; raise priority.
  • File the RCA per ../incidents/0000-template.md.

Prevention

  • Provider abstraction: ship CAPTCHA_PROVIDER env var + at least one backup (hCaptcha or Cloudflare Turnstile). The pluggable shape is in the ADR candidate noted under §Break-glass A.
  • Status-page poller: a tiny scheduled job that hits https://status.cloud.google.com/incidents.json every 5 min and posts to #oncall on a reCAPTCHA-tagged incident. Cheap and removes minutes from detection.
  • Edge rate-limit: every /auth/* route should be rate-limited at the edge (CDN / WAF) regardless of captcha state — captcha is a quality gate, not the only abuse defense.
  • Audit isLocal checks: apps/api/src/captcha/google/recaptcha.service.ts:28 is one of several. Any production-critical guard that bypasses on isLocal is a potential foot-gun if NODE_ENV is misconfigured. List them in the security register; review at every onboarding rotation.

Cross-references