Template: Incident Progress Update¶
Recurring updates during an active P0 / P1, fired every 30 minutes (P0) or 60 minutes (P1) until resolution. Pick the variant that matches the current investigation state.
Approval gate: P0 needs on-call-lead sign-off; P1 needs Tier 2 senior sign-off. See
README.md§"Approval workflow".
When to use which variant¶
The four variants below correspond to the phases of an incident investigation. You always advance forward — never sideways or backward in the same incident. A "we've identified" update never reverts to "still investigating" without a fresh public statement explaining why.
| Variant | Use when | Rules |
|---|---|---|
| A. Still investigating | Cause not yet confirmed | No speculation. State only what's been ruled out. |
| B. Cause identified | Specific subsystem confirmed | Customer-language description. No internal service names. |
| C. Mitigation in progress | Fix is being deployed / actioned | Don't promise duration; give a range. |
| D. Mitigation complete, monitoring | Service appears recovered; watching for re-occurrence | Stay in this state ≥ 15 minutes before declaring resolved. |
Variant A — Still investigating¶
Public — status page¶
Update: We are continuing to investigate {IMPACTED_SERVICES}.
Current status: investigating
What we know so far: {WHAT_WE_KNOW}
What we have ruled out: {WHAT_WE_RULED_OUT}
Next update by: {NEXT_UPDATE_BY}
{WHAT_WE_KNOW} is allowed to be empty ("we are still gathering information") for the first update. By the second, it must contain at least one specific signal — even if that signal is "the issue is intermittent" or "the issue affects a subset of users".
Internal — Slack thread¶
[update {N} @ {TIMESTAMP_UTC}] still investigating
- ruled out: {WHAT_WE_RULED_OUT}
- working hypothesis: {HYPOTHESIS}
- next experiment: {NEXT_DIAG_STEP}
- next public update: {NEXT_UPDATE_BY}
Working hypotheses are fine internally. They never appear on the status page until confirmed.
Variant B — Cause identified¶
Public — status page¶
Update: We have identified the cause of the issue affecting {IMPACTED_SERVICES}.
Current status: identified
Summary: {ROOT_CAUSE_SHORT}
Mitigation: {MITIGATION_SUMMARY}
Next update by: {NEXT_UPDATE_BY}
{ROOT_CAUSE_SHORT} is one customer-language sentence. Examples:
- ✅ "A configuration error caused the sign-in service to reject valid credentials."
- ✅ "Increased database load caused some bet placements to time out."
- ❌ "A bug in
auth.service.ts:142broke bcrypt comparison." (internal) - ❌ "The Postgres connection pool was exhausted by the BullMQ worker." (internal)
Internal — Slack thread¶
[update {N} @ {TIMESTAMP_UTC}] cause identified
- root cause: {ROOT_CAUSE_INTERNAL}
- file:line: {SOURCE_REFERENCE}
- mitigation: {MITIGATION_INTERNAL}
- ETA to mitigation deploy: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}
Variant C — Mitigation in progress¶
Public — status page¶
Update: A fix has been identified for the issue affecting {IMPACTED_SERVICES} and is being deployed.
Current status: monitoring (mitigation in progress)
Estimated time to recovery: {ETA_RANGE}
Next update by: {NEXT_UPDATE_BY}
{ETA_RANGE} is always a range with an upper bound. Examples:
- ✅ "10–30 minutes"
- ✅ "less than 1 hour"
- ❌ "shortly" (meaningless)
- ❌ "by 14:00 UTC" (single point — overshoot looks like deception)
Internal — Slack thread¶
[update {N} @ {TIMESTAMP_UTC}] mitigation in progress
- action: {DEPLOY_OR_RUNBOOK_STEP}
- owner: {ENGINEER}
- watch: {METRIC_OR_LOG_QUERY}
- ETA: {ETA_RANGE}
- next public update: {NEXT_UPDATE_BY}
Variant D — Mitigation complete, monitoring¶
Public — status page¶
Update: Mitigation has been deployed and {IMPACTED_SERVICES} appears to be recovering. We are continuing to monitor.
Current status: monitoring
What you can expect: most users should see service restored. We will confirm full resolution after a {MONITOR_DURATION} monitoring window.
Next update by: {NEXT_UPDATE_BY}
Stay in this variant for at least 15 minutes (P0) or 30 minutes (P1) before transitioning to incident-resolved.md. If a re-occurrence is detected, return to Variant A and post a new "investigating" update — do not silently regress.
Internal — Slack thread¶
[update {N} @ {TIMESTAMP_UTC}] monitoring
- mitigation landed at {TIMESTAMP_UTC}
- error rate: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- p95 latency: {VALUE_NOW} vs. baseline {VALUE_BASELINE}
- monitoring window ends: {MONITOR_END}
- next public update: {NEXT_UPDATE_BY}
Variables to fill¶
| Variable | Notes |
|---|---|
{IMPACTED_SERVICES} |
Same wording as in incident-acknowledgement.md. Don't change between updates without a reason. |
{NEXT_UPDATE_BY} |
+30 min P0; +60 min P1. Always land before the promised time. Re-promising 5 minutes late is worse than re-promising 1 minute early with nothing new to say. |
{WHAT_WE_KNOW} / {WHAT_WE_RULED_OUT} |
One short bullet each. Customer language. |
{ROOT_CAUSE_SHORT} |
One sentence, customer language. Variant B onwards. |
{MITIGATION_SUMMARY} |
One sentence. "We are deploying a fix" / "We have rolled back the recent change" / "We have scaled additional capacity". |
{ETA_RANGE} |
Always a range. Variant C only. |
{MONITOR_DURATION} |
"15 minutes" P0; "30 minutes" P1. |
Cross-references¶
incident-acknowledgement.md— the previous message in the sequenceincident-resolved.md— the next message after Variant D's monitoring window../oncall-runbook.md§4 — the comms-template-aware first-response procedureREADME.md— decision tree, channel matrix, approval workflow