Skip to main content
Featured

The Incident Response Playbook: From Detection to Post-Mortem

January 18, 2025By Steve Winter19 min read
...
frameworks

A battle-tested framework for handling production incidents—from the first alert to the blameless post-mortem. Includes severity classification, escalation playbooks, communication templates, and lessons from real outages.

When Production Breaks at 2 AM

Your phone vibrates. PagerDuty alert: "API Error Rate above 10%". Your first thought: "How bad is this?" Your second: "Who do I wake up?"

By the time you assess severity, identify the root cause, and coordinate a fix, an hour has passed. Customers are angry. The sales team is panicking. The CEO is asking questions you can't answer yet.

You need a playbook—not just for fixing incidents, but for managing the chaos that comes with them.

Why Most Incident Response Fails

The technical fix is usually straightforward: roll back the deploy, restart the service, scale up capacity. The hard part is everything else:

  • Severity assessment takes 20 minutes because there's no clear definition
  • Escalation is ad-hoc ("Should I wake the CTO?")
  • Communication is fragmented (Slack, email, PagerDuty, calls)
  • Customer updates are delayed or confusing
  • Post-mortems never happen or turn into blame sessions

Without a framework, every incident becomes a crisis.

The Complete Incident Response Framework

Part 1: Severity Classification (The Foundation)

Every incident needs a severity level within 5 minutes of detection. This drives everything else: who gets paged, how fast you respond, and whether you wake the CEO.

The Severity Matrix

SEV1: Critical

  • Impact: Complete outage or data loss
  • Customers Affected: Over 50% or all enterprise customers
  • Revenue Impact: Direct revenue loss
  • Response Time: Immediate
  • Escalation: Page on-call + EM + CTO
  • Communication: Every 30 minutes

Examples:

  • Login is down (nobody can access the product)
  • Payment processing failures
  • Data breach or security incident
  • Complete service outage

SEV2: High

  • Impact: Major feature broken or severe degradation
  • Customers Affected: 10-50% or multiple key accounts
  • Revenue Impact: Indirect (churn risk)
  • Response Time: 15 minutes
  • Escalation: Page on-call + EM
  • Communication: Hourly updates

Examples:

  • Search is down but core product works
  • API latency 10x normal
  • Email delivery delayed 2+ hours
  • Database read replica failed (redundancy lost)

SEV3: Medium

  • Impact: Minor feature broken or performance issue
  • Customers Affected: Under 10% or specific segment
  • Revenue Impact: None immediate
  • Response Time: 1 hour
  • Escalation: Notify on-call (no page)
  • Communication: Initial + resolution updates only

Examples:

  • Export to PDF broken
  • Dashboard loading slowly
  • Non-critical API endpoints erroring
  • Scheduled job delayed

SEV4: Low

  • Impact: Cosmetic or minimal functionality
  • Customers Affected: Few or none
  • Revenue Impact: None
  • Response Time: Next business day
  • Escalation: Create ticket
  • Communication: None (internal only)

Examples:

  • Typo on landing page
  • Broken link in documentation
  • Internal dashboard bug
  • Non-customer-facing error

Quick Severity Decision Tree

Is the product completely unusable?
  YES -> SEV1
  NO:

Can customers complete their core workflow?
  NO -> SEV2
  YES:

Are more than 10% of customers affected?
  YES -> SEV3
  NO -> SEV4

Template for Slack:

⚠️ INCIDENT DECLARED - SEV2
Service: API
Impact: 500 errors on /api/orders
Customers: ~30% seeing errors
IC: @alice
War Room: #incident-2025-01-18-001
Status Page: Updated
Next Update: 11:30 AM

Part 2: The Incident Commander System

Key Principle: One person owns the incident, even if they're not the most technical.

Incident Commander (IC) Responsibilities

NOT:

  • ❌ Fixing the incident (that's the engineer's job)
  • ❌ Finding root cause (happens later)

YES:

  • ✅ Coordinating response
  • ✅ Making executive decisions (rollback now vs investigate)
  • ✅ Managing communication
  • ✅ Timekeeping and documentation

The IC Rotation

Who Can Be IC:

  • Senior engineers (familiar with systems)
  • Engineering managers
  • Technical product managers
  • NOT: Junior engineers (in first 6 months)

Rotation Schedule:

Week 1: Alice (IC Primary), Bob (IC Shadow)
Week 2: Bob (IC Primary), Carol (IC Shadow)
Week 3: Carol (IC Primary), David (IC Shadow)
...

Shadow IC: Observes, takes notes, learns—becomes primary next week

IC Checklist (First 15 Minutes)

  • [ ] Declare severity level
  • [ ] Create war room (#incident-YYYY-MM-DD-NNN)
  • [ ] Assign roles:
    • [ ] IC (you)
    • [ ] Tech Lead (debugging)
    • [ ] Comms Lead (customer updates)
    • [ ] Scribe (timeline notes)
  • [ ] Update status page
  • [ ] Page additional responders if needed
  • [ ] Set timer for next update (30 min for SEV1, 60 min for SEV2)

Part 3: The War Room Protocol

All incident coordination happens in ONE place: The war room channel

War Room Setup (Template)

Channel Name: #incident-2025-01-18-001 (date + sequence)

Channel Topic:

SEV2 | API Errors | IC: @alice | Started: 10:15 AM | Customers: ~30%

Pinned Message (Living Document):

## INCIDENT SUMMARY

**Severity**: SEV2
**Status**: INVESTIGATING
**Started**: 2025-01-18 10:15 AM UTC
**Service**: API (orders endpoint)
**Impact**: 30% of order requests failing with 500 errors

## ROLES
- IC: @alice
- Tech Lead: @bob (debugging)
- Comms: @carol (status page, support)
- Scribe: @david (timeline)

## TIMELINE
10:15 - Alert triggered (error rate above 10%)
10:17 - IC declared SEV2
10:20 - War room created
10:25 - Initial investigation: DB connection pool exhausted
10:35 - Scaled DB connections from 100 → 200
10:40 - Error rate dropped to 2%
10:45 - RESOLVED

## CUSTOMER IMPACT
- 2,450 failed order attempts
- Peak error rate: 35%
- Duration: 30 minutes

## NEXT STEPS
- [ ] Post-mortem scheduled: 2025-01-19 2 PM
- [ ] Root cause: Connection pool sizing

Communication Rules

DO:

  • ✅ All updates in war room thread
  • ✅ Tag people for specific asks
  • ✅ Use threaded replies (keeps main channel clean)

DON'T:

  • ❌ DM the IC (they won't see it)
  • ❌ Discuss in multiple channels
  • ❌ Debate solutions (try it or defer to post-mortem)

Part 4: Customer Communication Templates

Status Page Update Template

Initial Update (Within 5 minutes of SEV1/SEV2):

🔴 Investigating - API Errors

We are investigating an increase in API errors affecting order processing.
Some customers may experience failed transactions.

Our team is actively working on a resolution.

Started: 10:15 AM UTC
Next update: 10:45 AM UTC

Progress Update:

🟡 Identified - API Errors

We've identified the root cause as a database connection issue and are
implementing a fix. Error rates have decreased from 35% to 10%.

Current impact: Orders may still fail intermittently.

Next update: 11:15 AM UTC or when resolved.

Resolution Update:

🟢 Resolved - API Errors

The incident has been resolved. All services are operating normally.

Summary:
- Duration: 30 minutes (10:15 AM - 10:45 AM UTC)
- Cause: Database connection pool exhausted during traffic spike
- Resolution: Increased connection pool capacity
- Impact: ~2,450 failed order attempts

We apologize for the disruption and have implemented additional monitoring
to prevent recurrence.

Support Team Template

Subject: INCIDENT ALERT - SEV2 - API Errors

Priority: HIGH
Status: INVESTIGATING

**What customers will see:**
- Failed order submissions
- "Something went wrong" error message
- Orders not appearing in dashboard

**What to tell customers:**
- "We're aware of an issue affecting order processing and are working on it"
- "No data has been lost - your order will complete once resolved"
- ETA: Within 1 hour

**Do NOT say:**
- Database is down (too technical)
- We don't know what's wrong (creates panic)

**Escalation:**
If customer is enterprise or threatening to churn → Alert @sales

**Updates:**
War room: #incident-2025-01-18-001
Next update: 11:00 AM

Part 5: The Decision Framework

Incidents require fast decisions. Use this framework:

Rollback vs Fix Forward

Rollback IF:

  • ✅ Deploy was in last 2 hours
  • ✅ Previous version was stable
  • ✅ Rollback takes under 5 minutes
  • ✅ You're not 100% sure of the fix

Fix Forward IF:

  • ✅ Root cause is clear and simple
  • ✅ Fix can be deployed in under 10 minutes
  • ✅ Rollback would cause data loss
  • ✅ Issue existed before last deploy

Example:

  • Bug introduced in deploy 30 minutes ago → Rollback
  • Database migration failing → Fix forward (can't undo migration)
  • Config change caused issue → Rollback to previous config

Scale Up vs Debug

Scale Up First IF:

  • Traffic spike overwhelming system
  • Error rate above 20%
  • Scaling takes under 2 minutes (horizontal pod autoscaling)

Debug First IF:

  • Error rate under 10%
  • Scaling won't help (logic bug)
  • Already at max capacity

Parallel Approach (SEV1):

  • IC: "Bob, start scaling up API pods"
  • IC: "Alice, debug root cause in parallel"
  • IC: "Whoever finishes first, we go with that solution"

Escalation Decision Points

Page the CTO if:

  • SEV1 and not resolved in 30 minutes
  • Security breach or data loss
  • Media/PR implications
  • Customer demanding executive attention

Page CEO if:

  • Multi-hour SEV1 outage
  • Security breach with data exfiltration
  • Legal/regulatory notification required

Part 6: The Blameless Post-Mortem

When: Within 48 hours of resolution (while memory is fresh)

Who Attends:

  • IC
  • All incident responders
  • Engineering leads
  • Product/design if customer-facing

Duration: 60 minutes max

The 5-Why Root Cause Analysis

Example Incident: API errors due to DB connection exhaustion

Why did the API fail?
→ Database connection pool was exhausted

Why was the connection pool exhausted?
→ Traffic spiked 3x normal levels

Why didn't the connection pool scale?
→ It was hard-coded to 100 connections

Why was it hard-coded?
→ Original configuration from 2 years ago, never updated

Why wasn't it updated?
→ No monitoring alert for connection pool saturation

ROOT CAUSE: Missing monitoring + static configuration

Post-Mortem Template

# Post-Mortem: API Errors - 2025-01-18

**Severity**: SEV2
**Duration**: 30 minutes (10:15 AM - 10:45 AM UTC)
**Impact**: 2,450 failed orders, ~$12K revenue loss

## Summary

A traffic spike exhausted our database connection pool, causing 35% of
API requests to fail. The incident was resolved by increasing the pool
size and is being prevented via dynamic scaling.

## Timeline

- 10:15 - Monitoring alert: API error rate above 10%
- 10:17 - IC declared SEV2, created war room
- 10:20 - Status page updated
- 10:25 - Root cause identified: DB connection pool
- 10:35 - Scaled pool from 100 → 200 connections
- 10:40 - Error rate dropped to normal (under 2%)
- 10:45 - Incident resolved

## Root Cause (5 Whys)

[See above example]

## What Went Well

- ✅ Fast detection (2 minutes from spike to alert)
- ✅ Clear IC leadership
- ✅ Good war room discipline
- ✅ Quick fix once root cause identified

## What Went Wrong

- ❌ No proactive connection pool monitoring
- ❌ Static configuration not reviewed in 2 years
- ❌ Delay in initial status page update (10 minutes)

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool monitoring | @bob | 2025-01-20 | P0 |
| Implement dynamic pool scaling | @alice | 2025-01-25 | P0 |
| Audit all static configs | @team | 2025-02-01 | P1 |
| Automate status page updates | @carol | 2025-01-22 | P1 |

## Lessons Learned

1. **Monitoring gap**: We monitor error rates but not resource saturation
2. **Configuration debt**: Static configs become stale over time
3. **Communication**: War room worked well, but status page was slow

## Prevention

- Dynamic connection pooling based on traffic
- Weekly config review process
- Added to incident response checklist: Update status page within 5 min

Key: Focus on systems, not people. "Configuration was static" not "Bob forgot to update config"


Incident Response Metrics

Track these to improve over time:

Response Metrics

  • MTTD (Mean Time to Detect): Alert to human acknowledgment
  • MTTA (Mean Time to Acknowledge): Alert to IC assigned
  • MTTI (Mean Time to Investigate): IC assigned to root cause found
  • MTTR (Mean Time to Resolve): Root cause to full resolution

Targets:

  • MTTD: Under 5 minutes (automated alerting)
  • MTTA: Under 10 minutes (on-call response)
  • MTTI: Under 30 minutes for SEV1, under 2 hours for SEV2
  • MTTR: Under 1 hour for SEV1, under 4 hours for SEV2

Quality Metrics

  • Repeat Incidents: Same root cause within 90 days (under 10%)
  • Action Item Completion: Post-mortem tasks done on time (over 90%)
  • False Positive Rate: Alerts that aren't real incidents (under 20%)

On-Call Best Practices

Rotation Structure

Primary + Secondary:

  • Primary on-call: First responder (pages immediately)
  • Secondary on-call: Backup if primary doesn't ack in 5 minutes

Duration: 1 week shifts (Monday 9 AM to Monday 9 AM)

Compensation:

  • $X per on-call shift (even if no incidents)
  • $Y per hour during incident response
  • Or: Time-in-lieu (1.5x hours worked)

On-Call Handoff Template

Hey @next-person, you're on-call starting Monday 9 AM.

**Recent Incidents**:
- 2025-01-18: SEV2 API errors (resolved, post-mortem done)
- 2025-01-15: SEV3 slow dashboard (still investigating)

**Ongoing Issues**:
- Database replica lag higher than normal (monitoring)
- Memory leak in worker service (fix deploying Friday)

**Deploy Schedule**:
- Tuesday 2 PM: API v2.1.0
- Thursday 10 AM: Database migration

**Escalation Contacts**:
- IC: me (until Monday 9 AM), then you
- CTO: @cto (SEV1 over 30min or security)
- Database: @dba (if DB-related)

**Good luck! Last week was quiet, hopefully this one is too.**

Download: Incident Response Pack

[Coming soon: Complete template bundle]

Includes:

  • Severity classification matrix
  • IC runbook (checklist format)
  • War room setup script
  • Customer communication templates
  • Post-mortem template
  • On-call handoff template
  • Action item tracker

Remember: The goal isn't zero incidents—it's fast resolution and continuous improvement. Every incident is a learning opportunity. Document, learn, improve.