Template: Incident Progress Update¶

Recurring updates during an active P0 / P1, fired every 30 minutes (P0) or 60 minutes (P1) until resolution. Pick the variant that matches the current investigation state.

Approval gate: P0 needs on-call-lead sign-off; P1 needs Tier 2 senior sign-off. See README.md §"Approval workflow".

When to use which variant¶

The four variants below correspond to the phases of an incident investigation. You always advance forward — never sideways or backward in the same incident. A "we've identified" update never reverts to "still investigating" without a fresh public statement explaining why.

Variant	Use when	Rules
A. Still investigating	Cause not yet confirmed	No speculation. State only what's been ruled out.
B. Cause identified	Specific subsystem confirmed	Customer-language description. No internal service names.
C. Mitigation in progress	Fix is being deployed / actioned	Don't promise duration; give a range.
D. Mitigation complete, monitoring	Service appears recovered; watching for re-occurrence	Stay in this state ≥ 15 minutes before declaring resolved.

Variant A — Still investigating¶

Public — status page¶

Update: We are continuing to investigate {IMPACTED_SERVICES}.

Current status: investigating
What we know so far: {WHAT_WE_KNOW}
What we have ruled out: {WHAT_WE_RULED_OUT}
Next update by: {NEXT_UPDATE_BY}

{WHAT_WE_KNOW} is allowed to be empty ("we are still gathering information") for the first update. By the second, it must contain at least one specific signal — even if that signal is "the issue is intermittent" or "the issue affects a subset of users".

Internal — Slack thread¶

[update {N} @ {TIMESTAMP_UTC}] still investigating
- ruled out: {WHAT_WE_RULED_OUT}
- working hypothesis: {HYPOTHESIS}
- next experiment: {NEXT_DIAG_STEP}
- next public update: {NEXT_UPDATE_BY}

Working hypotheses are fine internally. They never appear on the status page until confirmed.

Variant B — Cause identified¶

Public — status page¶

Update: We have identified the cause of the issue affecting {IMPACTED_SERVICES}.

Current status: identified
Summary: {ROOT_CAUSE_SHORT}
Mitigation: {MITIGATION_SUMMARY}
Next update by: {NEXT_UPDATE_BY}

{ROOT_CAUSE_SHORT} is one customer-language sentence. Examples:

✅ "A configuration error caused the sign-in service to reject valid credentials."
✅ "Increased database load caused some bet placements to time out."
❌ "A bug in auth.service.ts:142 broke bcrypt comparison." (internal)
❌ "The Postgres connection pool was exhausted by the BullMQ worker." (internal)

Internal — Slack thread¶

[update {N} @ {TIMESTAMP_UTC}] cause identified
- root cause: {ROOT_CAUSE_INTERNAL}
- file:line: {SOURCE_REFERENCE}
- mitigation: {MITIGATION_INTERNAL}
- ETA to mitigation deploy: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}

Variant C — Mitigation in progress¶

Public — status page¶

Update: A fix has been identified for the issue affecting {IMPACTED_SERVICES} and is being deployed.

Current status: monitoring (mitigation in progress)
Estimated time to recovery: {ETA_RANGE}
Next update by: {NEXT_UPDATE_BY}

{ETA_RANGE} is always a range with an upper bound. Examples:

✅ "10–30 minutes"
✅ "less than 1 hour"
❌ "shortly" (meaningless)
❌ "by 14:00 UTC" (single point — overshoot looks like deception)

Internal — Slack thread¶

[update {N} @ {TIMESTAMP_UTC}] mitigation in progress
- action: {DEPLOY_OR_RUNBOOK_STEP}
- owner: {ENGINEER}
- watch: {METRIC_OR_LOG_QUERY}
- ETA: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}

Variant D — Mitigation complete, monitoring¶

Public — status page¶

Update: Mitigation has been deployed and {IMPACTED_SERVICES} appears to be recovering. We are continuing to monitor.

Current status: monitoring
What you can expect: most users should see service restored. We will confirm full resolution after a {MONITOR_DURATION} monitoring window.
Next update by: {NEXT_UPDATE_BY}

Stay in this variant for at least 15 minutes (P0) or 30 minutes (P1) before transitioning to incident-resolved.md. If a re-occurrence is detected, return to Variant A and post a new "investigating" update — do not silently regress.

Internal — Slack thread¶

[update {N} @ {TIMESTAMP_UTC}] monitoring
- mitigation landed at {TIMESTAMP_UTC}
- error rate: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- p95 latency: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- monitoring window ends: {MONITOR_END}
- next public update: {NEXT_UPDATE_BY}

Variables to fill¶

Variable	Notes
`{IMPACTED_SERVICES}`	Same wording as in `incident-acknowledgement.md`. Don't change between updates without a reason.
`{NEXT_UPDATE_BY}`	+30 min P0; +60 min P1. Always land before the promised time. Re-promising 5 minutes late is worse than re-promising 1 minute early with nothing new to say.
`{WHAT_WE_KNOW}` / `{WHAT_WE_RULED_OUT}`	One short bullet each. Customer language.
`{ROOT_CAUSE_SHORT}`	One sentence, customer language. Variant B onwards.
`{MITIGATION_SUMMARY}`	One sentence. "We are deploying a fix" / "We have rolled back the recent change" / "We have scaled additional capacity".
`{ETA_RANGE}`	Always a range. Variant C only.
`{MONITOR_DURATION}`	"15 minutes" P0; "30 minutes" P1.

Cross-references¶

incident-acknowledgement.md — the previous message in the sequence
incident-resolved.md — the next message after Variant D's monitoring window
../oncall-runbook.md §4 — the comms-template-aware first-response procedure
README.md — decision tree, channel matrix, approval workflow