Skip to content

Template: Incident Progress Update

Recurring updates during an active P0 / P1, fired every 30 minutes (P0) or 60 minutes (P1) until resolution. Pick the variant that matches the current investigation state.

Approval gate: P0 needs on-call-lead sign-off; P1 needs Tier 2 senior sign-off. See README.md §"Approval workflow".


When to use which variant

The four variants below correspond to the phases of an incident investigation. You always advance forward — never sideways or backward in the same incident. A "we've identified" update never reverts to "still investigating" without a fresh public statement explaining why.

Variant Use when Rules
A. Still investigating Cause not yet confirmed No speculation. State only what's been ruled out.
B. Cause identified Specific subsystem confirmed Customer-language description. No internal service names.
C. Mitigation in progress Fix is being deployed / actioned Don't promise duration; give a range.
D. Mitigation complete, monitoring Service appears recovered; watching for re-occurrence Stay in this state ≥ 15 minutes before declaring resolved.

Variant A — Still investigating

Public — status page

Update: We are continuing to investigate {IMPACTED_SERVICES}.

Current status: investigating
What we know so far: {WHAT_WE_KNOW}
What we have ruled out: {WHAT_WE_RULED_OUT}
Next update by: {NEXT_UPDATE_BY}

{WHAT_WE_KNOW} is allowed to be empty ("we are still gathering information") for the first update. By the second, it must contain at least one specific signal — even if that signal is "the issue is intermittent" or "the issue affects a subset of users".

Internal — Slack thread

[update {N} @ {TIMESTAMP_UTC}] still investigating
- ruled out: {WHAT_WE_RULED_OUT}
- working hypothesis: {HYPOTHESIS}
- next experiment: {NEXT_DIAG_STEP}
- next public update: {NEXT_UPDATE_BY}

Working hypotheses are fine internally. They never appear on the status page until confirmed.


Variant B — Cause identified

Public — status page

Update: We have identified the cause of the issue affecting {IMPACTED_SERVICES}.

Current status: identified
Summary: {ROOT_CAUSE_SHORT}
Mitigation: {MITIGATION_SUMMARY}
Next update by: {NEXT_UPDATE_BY}

{ROOT_CAUSE_SHORT} is one customer-language sentence. Examples:

  • ✅ "A configuration error caused the sign-in service to reject valid credentials."
  • ✅ "Increased database load caused some bet placements to time out."
  • ❌ "A bug in auth.service.ts:142 broke bcrypt comparison." (internal)
  • ❌ "The Postgres connection pool was exhausted by the BullMQ worker." (internal)

Internal — Slack thread

[update {N} @ {TIMESTAMP_UTC}] cause identified
- root cause: {ROOT_CAUSE_INTERNAL}
- file:line: {SOURCE_REFERENCE}
- mitigation: {MITIGATION_INTERNAL}
- ETA to mitigation deploy: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}

Variant C — Mitigation in progress

Public — status page

Update: A fix has been identified for the issue affecting {IMPACTED_SERVICES} and is being deployed.

Current status: monitoring (mitigation in progress)
Estimated time to recovery: {ETA_RANGE}
Next update by: {NEXT_UPDATE_BY}

{ETA_RANGE} is always a range with an upper bound. Examples:

  • ✅ "10–30 minutes"
  • ✅ "less than 1 hour"
  • ❌ "shortly" (meaningless)
  • ❌ "by 14:00 UTC" (single point — overshoot looks like deception)

Internal — Slack thread

[update {N} @ {TIMESTAMP_UTC}] mitigation in progress
- action: {DEPLOY_OR_RUNBOOK_STEP}
- owner: {ENGINEER}
- watch: {METRIC_OR_LOG_QUERY}
- ETA: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}

Variant D — Mitigation complete, monitoring

Public — status page

Update: Mitigation has been deployed and {IMPACTED_SERVICES} appears to be recovering. We are continuing to monitor.

Current status: monitoring
What you can expect: most users should see service restored. We will confirm full resolution after a {MONITOR_DURATION} monitoring window.
Next update by: {NEXT_UPDATE_BY}

Stay in this variant for at least 15 minutes (P0) or 30 minutes (P1) before transitioning to incident-resolved.md. If a re-occurrence is detected, return to Variant A and post a new "investigating" update — do not silently regress.

Internal — Slack thread

[update {N} @ {TIMESTAMP_UTC}] monitoring
- mitigation landed at {TIMESTAMP_UTC}
- error rate: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- p95 latency: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- monitoring window ends: {MONITOR_END}
- next public update: {NEXT_UPDATE_BY}

Variables to fill

Variable Notes
{IMPACTED_SERVICES} Same wording as in incident-acknowledgement.md. Don't change between updates without a reason.
{NEXT_UPDATE_BY} +30 min P0; +60 min P1. Always land before the promised time. Re-promising 5 minutes late is worse than re-promising 1 minute early with nothing new to say.
{WHAT_WE_KNOW} / {WHAT_WE_RULED_OUT} One short bullet each. Customer language.
{ROOT_CAUSE_SHORT} One sentence, customer language. Variant B onwards.
{MITIGATION_SUMMARY} One sentence. "We are deploying a fix" / "We have rolled back the recent change" / "We have scaled additional capacity".
{ETA_RANGE} Always a range. Variant C only.
{MONITOR_DURATION} "15 minutes" P0; "30 minutes" P1.

Cross-references