Template: Incident Resolved¶
The wrap-up message after the monitoring window in incident-progress-update.md (Variant D) closes cleanly. Sent within immediate (P0) or 1 hour (P1) of confirmed full recovery.
Approval gate: P0 needs on-call-lead sign-off; P1 needs Tier 2 senior sign-off. See
README.md§"Approval workflow".
Public version — status page (resolution post)¶
Resolved: {IMPACTED_SERVICES}
The issue affecting {IMPACTED_SERVICES} has been resolved. All services are operating normally as of {TIME_RESOLVED}.
Summary
- Detected: {TIME_DETECTED}
- Mitigated: {TIME_MITIGATED}
- Resolved: {TIME_RESOLVED}
- Total duration: {DURATION}
- Root cause: {ROOT_CAUSE_SHORT}
- Corrective action: {CORRECTIVE_ACTION_SHORT}
We will publish a full incident review (RCA) within {RCA_TIMELINE_DAYS} days at {STATUS_PAGE_URL}.
Thank you for your patience.
{ROOT_CAUSE_SHORT} and {CORRECTIVE_ACTION_SHORT} are each one customer-language sentence:
- ✅ "Increased database load caused some bet placements to time out. We have added capacity and are reviewing the load patterns to prevent recurrence."
- ❌ "Postgres connection pool was exhausted because the BullMQ worker held a transaction across an external RPC call." (save for the RCA)
Public version — customer email (SLA-bound customers)¶
For customers under SLA contract, send within 1 hour of resolution. Mirror the status-page wording; add the contractual escalation path.
Subject: [Incident {INCIDENT_ID} resolved] {IMPACTED_SERVICES}
Hello,
Incident {INCIDENT_ID} affecting {IMPACTED_SERVICES} on {CUSTOMER_NAME} has been resolved.
Timeline (UTC)
- Detected: {TIME_DETECTED}
- Mitigated: {TIME_MITIGATED}
- Resolved: {TIME_RESOLVED}
- Total duration: {DURATION}
Impact summary
{IMPACT_SUMMARY}
Root cause (summary)
{ROOT_CAUSE_SHORT}
Corrective action committed
{CORRECTIVE_ACTION_SHORT}
We will publish a full incident review within {RCA_TIMELINE_DAYS} business days. If you have questions or wish to discuss the SLA implications of this incident, please reply to this email or contact {SUPPORT_EMAIL}.
Regards,
{CUSTOMER_NAME} Operations
{IMPACT_SUMMARY} is two to three sentences naming what users could / could not do during the window. Numbers if you have them ("approximately {COUNT_AFFECTED} users were unable to sign in"); ranges if you don't ("a subset of users in the {REGION} region experienced …").
Public version — Twitter / social¶
P0 only.
The earlier issue affecting {IMPACTED_SERVICES} on {CUSTOMER_NAME} has been resolved. Full timeline: {STATUS_PAGE_URL}. A full incident review will follow within {RCA_TIMELINE_DAYS} days.
Internal version — Slack #oncall¶
The IC posts in the channel root, then unpins the topic.
:white_check_mark: P{SEVERITY} resolved — {IMPACTED_SERVICES}
INC: {INCIDENT_ID}
Detected: {TIME_DETECTED}
Mitigated: {TIME_MITIGATED}
Resolved: {TIME_RESOLVED}
Duration: {DURATION}
IC: {IC_NAME}
Root cause: {ROOT_CAUSE_INTERNAL}
Mitigation: {MITIGATION_INTERNAL}
RCA owner: {RCA_OWNER}
RCA due: {RCA_DUE_DATE}
RCA filename: {RCA_FILENAME}
{RCA_FILENAME} is the path the IC will create per ../../incidents/0000-template.md — date-prefixed, slug = the failure mode.
RCA timeline (when to expect public RCA)¶
| Severity | RCA published within |
|---|---|
| P0 | 5 business days |
| P1 | 14 business days |
| P2 | not published publicly; internal RCA only when warranted |
| P3 | no RCA |
The internal RCA template is ../../incidents/0000-template.md. The public RCA, when published, is a customer-language summary derived from the internal version — same timeline, same root cause, same corrective actions, but redacted to remove internal service names, file:line citations, and any vendor-contractual detail.
Variables to fill¶
| Variable | Notes |
|---|---|
{INCIDENT_ID} |
Same as incident-acknowledgement.md |
{TIME_DETECTED} / {TIME_MITIGATED} / {TIME_RESOLVED} |
All ISO 8601 UTC. mitigated is when error rate returned to baseline; resolved is end of monitoring window. |
{DURATION} |
TIME_RESOLVED - TIME_DETECTED, formatted as Hh Mm (e.g., 1h 23m) |
{ROOT_CAUSE_SHORT} |
One customer-language sentence — same wording as the final progress-update Variant B if you posted one |
{CORRECTIVE_ACTION_SHORT} |
One customer-language sentence; what you committed to changing |
{IMPACT_SUMMARY} |
2–3 sentences. Customer email only. |
{RCA_TIMELINE_DAYS} |
5 (P0) or 14 (P1) |
{RCA_OWNER} |
IC by default; reassigned at retro |
{RCA_DUE_DATE} |
TIME_RESOLVED + RCA_TIMELINE_DAYS business days |
{RCA_FILENAME} |
incidents/<YYYY-MM-DD>-<slug>.md |
Don'ts¶
- Don't publish a resolution before the monitoring window in
incident-progress-update.mdVariant D has elapsed cleanly. - Don't skip the RCA promise. Even if the cause is trivial, the public RCA timeline commitment is part of the contract.
- Don't disclose vendor names in the public version. "A configuration issue with one of our infrastructure providers" is the right framing if a vendor was at fault.
- Don't commit to corrective actions you can't actually ship. Be specific only about actions already in flight; otherwise use the framing "we are reviewing the load patterns to prevent recurrence."
Cross-references¶
incident-progress-update.md— Variant D (monitoring) is the predecessor of this template../../incidents/0000-template.md— internal RCA template../oncall-runbook.md§5 — post-incident workflow (RCA, blameless retro, action-item tracking)README.md— decision tree, channel matrix, approval workflow