Skip to content

Template: Service Degradation (P2)

For partial degradations that aren't full outages — a single feature is slow or unreliable, a subset of users sees errors, a non-critical endpoint is failing. This is the P2 comms template per the README.md decision tree.

Approval gate: P2 templates send without sign-off. The customer team's first responder posts directly. Escalate to P1 (and use incident-acknowledgement.md) if the scope grows during the incident.


When to use this template (vs. P0/P1 templates)

Situation Template
Feature is slow but functional for everyone This template (single notice; update only if scope grows)
Feature fails for a subset (% < 10) of users This template
Feature is unavailable for everyone incident-acknowledgement.md (P1)
Single user reports a bug No public comm; ticket reply only (P3)
Cosmetic issue, no user-visible impact No public comm

The line between P2 and P1 is scope, not symptom. If 5% of users can't sign in, that's P2. If 100% of users can't sign in, that's P1 — promote and switch templates immediately.


Public version — status page

Title: Investigating — degraded performance on {IMPACTED_FEATURE}

We are investigating reports of degraded performance affecting {IMPACTED_FEATURE} on {CUSTOMER_NAME}.

Current status: investigating (degraded service)
Some users may experience: {USER_VISIBLE_SYMPTOM}
Workaround (if available): {WORKAROUND_OR_NONE}
Estimated time to resolution: {ETA_RANGE}
Next update: when status changes, or by {NEXT_UPDATE_BY} if no change

Most users are not affected. We will update this page when we have more information.

{USER_VISIBLE_SYMPTOM} is one customer-language sentence describing what an affected user sees:

  • ✅ "longer-than-usual loading times when viewing bet history"
  • ✅ "intermittent failures when transferring funds to the vault"
  • ✅ "delayed leaderboard updates after placing bets"
  • ❌ "5xx error rate elevated on /bets" (internal)

{WORKAROUND_OR_NONE} — be honest. If there isn't one, "no workaround is currently available; the issue does not affect new bets" is fine.

{ETA_RANGE} — always a range, with an upper bound. For P2 the range can be wider than P0/P1 ("1–4 hours" is acceptable; "by end of business day" is acceptable).


Public version — customer email (SLA-bound only, optional)

For SLA-bound customers, P2 generally doesn't require email — the status page is sufficient. Send email only if:

  • The customer has a contractual right to email notification at P2 ({{TBD: contractual specifics, customer-team to confirm}}), or
  • The degradation directly affects the customer's day-to-day operation in a way that the status page wouldn't surface.
Subject: [P2 service degradation] {IMPACTED_FEATURE}

Hello,

We are investigating a partial degradation of {IMPACTED_FEATURE} on {CUSTOMER_NAME}.

Some users may experience: {USER_VISIBLE_SYMPTOM}
Workaround: {WORKAROUND_OR_NONE}
Estimated time to resolution: {ETA_RANGE}

Most users are not affected. Live updates: {STATUS_PAGE_URL}.

Regards,
{CUSTOMER_NAME} Operations

Internal version — Slack #oncall

P2 degradation — {IMPACTED_FEATURE}

INC: {INCIDENT_ID}
Detected: {TIME_DETECTED}
Owner (Tier 1): {RESPONDER}
Status page: posted ({STATUS_PAGE_URL})
Symptom: {SYMPTOM_INTERNAL}
Hypothesis: {HYPOTHESIS_OR_NONE}
Working ticket: {TICKET_LINK}

Watch for scope growth — promote to P1 if {PROMOTION_TRIGGER}.
Investigation thread :point_down:

{PROMOTION_TRIGGER} is the explicit upgrade condition the responder commits to watching. Examples: "if error rate crosses 10%", "if affected user count exceeds 1k", "if duration exceeds 4 hours". Don't leave it unstated — without a trigger, P2 incidents drift.


Workaround communication

If a workaround exists, communicate it carefully — workarounds that are wrong cause more harm than no workaround.

Workaround pattern Wording template
Retry-after-N-seconds "If you encounter the issue, please retry after {N} seconds."
Use-different-path "Affected users can use {ALTERNATIVE} as an alternative until the issue is resolved."
Wait-it-out "No action is required; the issue will resolve automatically."
Clear-cache / re-login "If you experience the issue, signing out and signing back in resolves it for most users."

Don't suggest a workaround you haven't validated. Confirm with the on-call engineer that the workaround actually works before publishing.


ETA framing — what's allowed

P2 incidents often last hours. The ETA framing rules:

  • ✅ "Estimated time to resolution: 1–4 hours."
  • ✅ "We expect resolution by end of business day."
  • ✅ "We are evaluating multiple fix paths; we will provide an updated ETA when we have one."
  • ❌ "We expect resolution within 30 minutes." (concrete short-promise — overshoots look bad)
  • ❌ "The issue will be fixed soon." (meaningless)

If the ETA passes without resolution, post an updated estimate before the original deadline — never after.


When to update the status page

The status page entry stays in investigating (degraded service) state until:

  • Resolved: confirmed fix landed; transition to a single resolution post (use the incident-resolved.md format, but lighter — no formal RCA timeline for P2 unless customer requests it).
  • Promoted to P1: scope grew. Close this status entry with "scope expanded — see incident {NEW_INCIDENT_ID}" and use incident-acknowledgement.md for the new entry.
  • Re-classified as expected behavior: rare, but possible — what looked like a degradation is actually a known limitation. Close with one sentence linking to the relevant doc.

Don't let a P2 entry sit on the status page for > 24 hours without an update. Either it resolved, or it warrants a fresh statement of what's still being investigated.


Variables to fill

Variable Notes
{INCIDENT_ID} Internal incident ID
{IMPACTED_FEATURE} Customer-language feature name. "Bet history", "Wallet transfers", "Leaderboards". Not the internal service or endpoint.
{USER_VISIBLE_SYMPTOM} One customer-language sentence — the user-side observation, not the engineering signal
{WORKAROUND_OR_NONE} Specific workaround or "no workaround is currently available"
{ETA_RANGE} Always a range; wider than P0/P1 is OK
{NEXT_UPDATE_BY} Default +4 hours for P2
{PROMOTION_TRIGGER} Internal only — the explicit condition that escalates this to P1
{TICKET_LINK} Working ticket in the team tracker

Cross-references