Skip to content

Template: Incident Resolved

The wrap-up message after the monitoring window in incident-progress-update.md (Variant D) closes cleanly. Sent within immediate (P0) or 1 hour (P1) of confirmed full recovery.

Approval gate: P0 needs on-call-lead sign-off; P1 needs Tier 2 senior sign-off. See README.md §"Approval workflow".


Public version — status page (resolution post)

Resolved: {IMPACTED_SERVICES}

The issue affecting {IMPACTED_SERVICES} has been resolved. All services are operating normally as of {TIME_RESOLVED}.

Summary
- Detected: {TIME_DETECTED}
- Mitigated: {TIME_MITIGATED}
- Resolved: {TIME_RESOLVED}
- Total duration: {DURATION}
- Root cause: {ROOT_CAUSE_SHORT}
- Corrective action: {CORRECTIVE_ACTION_SHORT}

We will publish a full incident review (RCA) within {RCA_TIMELINE_DAYS} days at {STATUS_PAGE_URL}.

Thank you for your patience.

{ROOT_CAUSE_SHORT} and {CORRECTIVE_ACTION_SHORT} are each one customer-language sentence:

  • ✅ "Increased database load caused some bet placements to time out. We have added capacity and are reviewing the load patterns to prevent recurrence."
  • ❌ "Postgres connection pool was exhausted because the BullMQ worker held a transaction across an external RPC call." (save for the RCA)

Public version — customer email (SLA-bound customers)

For customers under SLA contract, send within 1 hour of resolution. Mirror the status-page wording; add the contractual escalation path.

Subject: [Incident {INCIDENT_ID} resolved] {IMPACTED_SERVICES}

Hello,

Incident {INCIDENT_ID} affecting {IMPACTED_SERVICES} on {CUSTOMER_NAME} has been resolved.

Timeline (UTC)
- Detected: {TIME_DETECTED}
- Mitigated: {TIME_MITIGATED}
- Resolved: {TIME_RESOLVED}
- Total duration: {DURATION}

Impact summary
{IMPACT_SUMMARY}

Root cause (summary)
{ROOT_CAUSE_SHORT}

Corrective action committed
{CORRECTIVE_ACTION_SHORT}

We will publish a full incident review within {RCA_TIMELINE_DAYS} business days. If you have questions or wish to discuss the SLA implications of this incident, please reply to this email or contact {SUPPORT_EMAIL}.

Regards,
{CUSTOMER_NAME} Operations

{IMPACT_SUMMARY} is two to three sentences naming what users could / could not do during the window. Numbers if you have them ("approximately {COUNT_AFFECTED} users were unable to sign in"); ranges if you don't ("a subset of users in the {REGION} region experienced …").


Public version — Twitter / social

P0 only.

The earlier issue affecting {IMPACTED_SERVICES} on {CUSTOMER_NAME} has been resolved. Full timeline: {STATUS_PAGE_URL}. A full incident review will follow within {RCA_TIMELINE_DAYS} days.

Internal version — Slack #oncall

The IC posts in the channel root, then unpins the topic.

:white_check_mark: P{SEVERITY} resolved — {IMPACTED_SERVICES}

INC: {INCIDENT_ID}
Detected:  {TIME_DETECTED}
Mitigated: {TIME_MITIGATED}
Resolved:  {TIME_RESOLVED}
Duration:  {DURATION}
IC: {IC_NAME}

Root cause: {ROOT_CAUSE_INTERNAL}
Mitigation: {MITIGATION_INTERNAL}
RCA owner: {RCA_OWNER}
RCA due: {RCA_DUE_DATE}
RCA filename: {RCA_FILENAME}

{RCA_FILENAME} is the path the IC will create per ../../incidents/0000-template.md — date-prefixed, slug = the failure mode.


RCA timeline (when to expect public RCA)

Severity RCA published within
P0 5 business days
P1 14 business days
P2 not published publicly; internal RCA only when warranted
P3 no RCA

The internal RCA template is ../../incidents/0000-template.md. The public RCA, when published, is a customer-language summary derived from the internal version — same timeline, same root cause, same corrective actions, but redacted to remove internal service names, file:line citations, and any vendor-contractual detail.


Variables to fill

Variable Notes
{INCIDENT_ID} Same as incident-acknowledgement.md
{TIME_DETECTED} / {TIME_MITIGATED} / {TIME_RESOLVED} All ISO 8601 UTC. mitigated is when error rate returned to baseline; resolved is end of monitoring window.
{DURATION} TIME_RESOLVED - TIME_DETECTED, formatted as Hh Mm (e.g., 1h 23m)
{ROOT_CAUSE_SHORT} One customer-language sentence — same wording as the final progress-update Variant B if you posted one
{CORRECTIVE_ACTION_SHORT} One customer-language sentence; what you committed to changing
{IMPACT_SUMMARY} 2–3 sentences. Customer email only.
{RCA_TIMELINE_DAYS} 5 (P0) or 14 (P1)
{RCA_OWNER} IC by default; reassigned at retro
{RCA_DUE_DATE} TIME_RESOLVED + RCA_TIMELINE_DAYS business days
{RCA_FILENAME} incidents/<YYYY-MM-DD>-<slug>.md

Don'ts

  • Don't publish a resolution before the monitoring window in incident-progress-update.md Variant D has elapsed cleanly.
  • Don't skip the RCA promise. Even if the cause is trivial, the public RCA timeline commitment is part of the contract.
  • Don't disclose vendor names in the public version. "A configuration issue with one of our infrastructure providers" is the right framing if a vendor was at fault.
  • Don't commit to corrective actions you can't actually ship. Be specific only about actions already in flight; otherwise use the framing "we are reviewing the load patterns to prevent recurrence."

Cross-references