The Incident Response Playbook: From Detection to Post-Mortem

When Production Breaks at 2 AM

Your phone vibrates. PagerDuty alert: "API Error Rate above 10%". Your first thought: "How bad is this?" Your second: "Who do I wake up?"

By the time you assess severity, identify the root cause, and coordinate a fix, an hour has passed. Customers are angry. The sales team is panicking. The CEO is asking questions you can't answer yet.

You need a playbook—not just for fixing incidents, but for managing the chaos that comes with them.

Why Most Incident Response Fails

The technical fix is usually straightforward: roll back the deploy, restart the service, scale up capacity. The hard part is everything else:

Severity assessment takes 20 minutes because there's no clear definition
Escalation is ad-hoc ("Should I wake the CTO?")
Communication is fragmented (Slack, email, PagerDuty, calls)
Customer updates are delayed or confusing
Post-mortems never happen or turn into blame sessions

Without a framework, every incident becomes a crisis.

The Complete Incident Response Framework

Part 1: Severity Classification (The Foundation)

Every incident needs a severity level within 5 minutes of detection. This drives everything else: who gets paged, how fast you respond, and whether you wake the CEO.

The Severity Matrix

SEV1: Critical

Impact: Complete outage or data loss
Customers Affected: Over 50% or all enterprise customers
Revenue Impact: Direct revenue loss
Response Time: Immediate
Escalation: Page on-call + EM + CTO
Communication: Every 30 minutes

Examples:

Login is down (nobody can access the product)
Payment processing failures
Data breach or security incident
Complete service outage

SEV2: High

Impact: Major feature broken or severe degradation
Customers Affected: 10-50% or multiple key accounts
Revenue Impact: Indirect (churn risk)
Response Time: 15 minutes
Escalation: Page on-call + EM
Communication: Hourly updates

Examples:

Search is down but core product works
API latency 10x normal
Email delivery delayed 2+ hours
Database read replica failed (redundancy lost)

SEV3: Medium

Impact: Minor feature broken or performance issue
Customers Affected: Under 10% or specific segment
Revenue Impact: None immediate
Response Time: 1 hour
Escalation: Notify on-call (no page)
Communication: Initial + resolution updates only

Examples:

Export to PDF broken
Dashboard loading slowly
Non-critical API endpoints erroring
Scheduled job delayed

SEV4: Low

Impact: Cosmetic or minimal functionality
Customers Affected: Few or none
Revenue Impact: None
Response Time: Next business day
Escalation: Create ticket
Communication: None (internal only)

Examples:

Typo on landing page
Broken link in documentation
Internal dashboard bug
Non-customer-facing error

Quick Severity Decision Tree

Is the product completely unusable?
  YES -> SEV1
  NO:

Can customers complete their core workflow?
  NO -> SEV2
  YES:

Are more than 10% of customers affected?
  YES -> SEV3
  NO -> SEV4

Template for Slack:

⚠️ INCIDENT DECLARED - SEV2
Service: API
Impact: 500 errors on /api/orders
Customers: ~30% seeing errors
IC: @alice
War Room: #incident-2025-01-18-001
Status Page: Updated
Next Update: 11:30 AM

Part 2: The Incident Commander System

Key Principle: One person owns the incident, even if they're not the most technical.

Incident Commander (IC) Responsibilities

NOT:

❌ Fixing the incident (that's the engineer's job)
❌ Finding root cause (happens later)

YES:

✅ Coordinating response
✅ Making executive decisions (rollback now vs investigate)
✅ Managing communication
✅ Timekeeping and documentation

The IC Rotation

Who Can Be IC:

Senior engineers (familiar with systems)
Engineering managers
Technical product managers
NOT: Junior engineers (in first 6 months)

Rotation Schedule:

Week 1: Alice (IC Primary), Bob (IC Shadow)
Week 2: Bob (IC Primary), Carol (IC Shadow)
Week 3: Carol (IC Primary), David (IC Shadow)
...

Shadow IC: Observes, takes notes, learns—becomes primary next week

IC Checklist (First 15 Minutes)

Part 3: The War Room Protocol

All incident coordination happens in ONE place: The war room channel

War Room Setup (Template)

Channel Name: #incident-2025-01-18-001 (date + sequence)

Channel Topic:

SEV2 | API Errors | IC: @alice | Started: 10:15 AM | Customers: ~30%

Pinned Message (Living Document):

## INCIDENT SUMMARY

**Severity**: SEV2
**Status**: INVESTIGATING
**Started**: 2025-01-18 10:15 AM UTC
**Service**: API (orders endpoint)
**Impact**: 30% of order requests failing with 500 errors

## ROLES
- IC: @alice
- Tech Lead: @bob (debugging)
- Comms: @carol (status page, support)
- Scribe: @david (timeline)

## TIMELINE
10:15 - Alert triggered (error rate above 10%)
10:17 - IC declared SEV2
10:20 - War room created
10:25 - Initial investigation: DB connection pool exhausted
10:35 - Scaled DB connections from 100 → 200
10:40 - Error rate dropped to 2%
10:45 - RESOLVED

## CUSTOMER IMPACT
- 2,450 failed order attempts
- Peak error rate: 35%
- Duration: 30 minutes

## NEXT STEPS
- [ ] Post-mortem scheduled: 2025-01-19 2 PM
- [ ] Root cause: Connection pool sizing

Communication Rules

DO:

✅ All updates in war room thread
✅ Tag people for specific asks
✅ Use threaded replies (keeps main channel clean)

DON'T:

❌ DM the IC (they won't see it)
❌ Discuss in multiple channels
❌ Debate solutions (try it or defer to post-mortem)

Part 4: Customer Communication Templates

Status Page Update Template

Initial Update (Within 5 minutes of SEV1/SEV2):

🔴 Investigating - API Errors

We are investigating an increase in API errors affecting order processing.
Some customers may experience failed transactions.

Our team is actively working on a resolution.

Started: 10:15 AM UTC
Next update: 10:45 AM UTC

Progress Update:

🟡 Identified - API Errors

We've identified the root cause as a database connection issue and are
implementing a fix. Error rates have decreased from 35% to 10%.

Current impact: Orders may still fail intermittently.

Next update: 11:15 AM UTC or when resolved.

Resolution Update:

🟢 Resolved - API Errors

The incident has been resolved. All services are operating normally.

Summary:
- Duration: 30 minutes (10:15 AM - 10:45 AM UTC)
- Cause: Database connection pool exhausted during traffic spike
- Resolution: Increased connection pool capacity
- Impact: ~2,450 failed order attempts

We apologize for the disruption and have implemented additional monitoring
to prevent recurrence.

Support Team Template

Subject: INCIDENT ALERT - SEV2 - API Errors

Priority: HIGH
Status: INVESTIGATING

**What customers will see:**
- Failed order submissions
- "Something went wrong" error message
- Orders not appearing in dashboard

**What to tell customers:**
- "We're aware of an issue affecting order processing and are working on it"
- "No data has been lost - your order will complete once resolved"
- ETA: Within 1 hour

**Do NOT say:**
- Database is down (too technical)
- We don't know what's wrong (creates panic)

**Escalation:**
If customer is enterprise or threatening to churn → Alert @sales

**Updates:**
War room: #incident-2025-01-18-001
Next update: 11:00 AM

Part 5: The Decision Framework

Incidents require fast decisions. Use this framework:

Rollback vs Fix Forward

Rollback IF:

✅ Deploy was in last 2 hours
✅ Previous version was stable
✅ Rollback takes under 5 minutes
✅ You're not 100% sure of the fix

Fix Forward IF:

✅ Root cause is clear and simple
✅ Fix can be deployed in under 10 minutes
✅ Rollback would cause data loss
✅ Issue existed before last deploy

Example:

Bug introduced in deploy 30 minutes ago → Rollback
Database migration failing → Fix forward (can't undo migration)
Config change caused issue → Rollback to previous config

Scale Up vs Debug

Scale Up First IF:

Traffic spike overwhelming system
Error rate above 20%
Scaling takes under 2 minutes (horizontal pod autoscaling)

Debug First IF:

Error rate under 10%
Scaling won't help (logic bug)
Already at max capacity

Parallel Approach (SEV1):

IC: "Bob, start scaling up API pods"
IC: "Alice, debug root cause in parallel"
IC: "Whoever finishes first, we go with that solution"

Escalation Decision Points

Page the CTO if:

SEV1 and not resolved in 30 minutes
Security breach or data loss
Media/PR implications
Customer demanding executive attention

Page CEO if:

Multi-hour SEV1 outage
Security breach with data exfiltration
Legal/regulatory notification required

Part 6: The Blameless Post-Mortem

When: Within 48 hours of resolution (while memory is fresh)

Who Attends:

IC
All incident responders
Engineering leads
Product/design if customer-facing

Duration: 60 minutes max

The 5-Why Root Cause Analysis

Example Incident: API errors due to DB connection exhaustion

Why did the API fail?
→ Database connection pool was exhausted

Why was the connection pool exhausted?
→ Traffic spiked 3x normal levels

Why didn't the connection pool scale?
→ It was hard-coded to 100 connections

Why was it hard-coded?
→ Original configuration from 2 years ago, never updated

Why wasn't it updated?
→ No monitoring alert for connection pool saturation

ROOT CAUSE: Missing monitoring + static configuration

Post-Mortem Template

# Post-Mortem: API Errors - 2025-01-18

**Severity**: SEV2
**Duration**: 30 minutes (10:15 AM - 10:45 AM UTC)
**Impact**: 2,450 failed orders, ~$12K revenue loss

## Summary

A traffic spike exhausted our database connection pool, causing 35% of
API requests to fail. The incident was resolved by increasing the pool
size and is being prevented via dynamic scaling.

## Timeline

- 10:15 - Monitoring alert: API error rate above 10%
- 10:17 - IC declared SEV2, created war room
- 10:20 - Status page updated
- 10:25 - Root cause identified: DB connection pool
- 10:35 - Scaled pool from 100 → 200 connections
- 10:40 - Error rate dropped to normal (under 2%)
- 10:45 - Incident resolved

## Root Cause (5 Whys)

[See above example]

## What Went Well

- ✅ Fast detection (2 minutes from spike to alert)
- ✅ Clear IC leadership
- ✅ Good war room discipline
- ✅ Quick fix once root cause identified

## What Went Wrong

- ❌ No proactive connection pool monitoring
- ❌ Static configuration not reviewed in 2 years
- ❌ Delay in initial status page update (10 minutes)

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool monitoring | @bob | 2025-01-20 | P0 |
| Implement dynamic pool scaling | @alice | 2025-01-25 | P0 |
| Audit all static configs | @team | 2025-02-01 | P1 |
| Automate status page updates | @carol | 2025-01-22 | P1 |

## Lessons Learned

1. **Monitoring gap**: We monitor error rates but not resource saturation
2. **Configuration debt**: Static configs become stale over time
3. **Communication**: War room worked well, but status page was slow

## Prevention

- Dynamic connection pooling based on traffic
- Weekly config review process
- Added to incident response checklist: Update status page within 5 min

Key: Focus on systems, not people. "Configuration was static" not "Bob forgot to update config"

Incident Response Metrics

Track these to improve over time:

Response Metrics

MTTD (Mean Time to Detect): Alert to human acknowledgment
MTTA (Mean Time to Acknowledge): Alert to IC assigned
MTTI (Mean Time to Investigate): IC assigned to root cause found
MTTR (Mean Time to Resolve): Root cause to full resolution

Targets:

MTTD: Under 5 minutes (automated alerting)
MTTA: Under 10 minutes (on-call response)
MTTI: Under 30 minutes for SEV1, under 2 hours for SEV2
MTTR: Under 1 hour for SEV1, under 4 hours for SEV2

Quality Metrics

Repeat Incidents: Same root cause within 90 days (under 10%)
Action Item Completion: Post-mortem tasks done on time (over 90%)
False Positive Rate: Alerts that aren't real incidents (under 20%)

On-Call Best Practices

Rotation Structure

Primary + Secondary:

Primary on-call: First responder (pages immediately)
Secondary on-call: Backup if primary doesn't ack in 5 minutes

Duration: 1 week shifts (Monday 9 AM to Monday 9 AM)

Compensation:

$X per on-call shift (even if no incidents)
$Y per hour during incident response
Or: Time-in-lieu (1.5x hours worked)

On-Call Handoff Template

Hey @next-person, you're on-call starting Monday 9 AM.

**Recent Incidents**:
- 2025-01-18: SEV2 API errors (resolved, post-mortem done)
- 2025-01-15: SEV3 slow dashboard (still investigating)

**Ongoing Issues**:
- Database replica lag higher than normal (monitoring)
- Memory leak in worker service (fix deploying Friday)

**Deploy Schedule**:
- Tuesday 2 PM: API v2.1.0
- Thursday 10 AM: Database migration

**Escalation Contacts**:
- IC: me (until Monday 9 AM), then you
- CTO: @cto (SEV1 over 30min or security)
- Database: @dba (if DB-related)

**Good luck! Last week was quiet, hopefully this one is too.**

Download: Incident Response Pack

[Coming soon: Complete template bundle]

Includes:

Severity classification matrix
IC runbook (checklist format)
War room setup script
Customer communication templates
Post-mortem template
On-call handoff template
Action item tracker

Remember: The goal isn't zero incidents—it's fast resolution and continuous improvement. Every incident is a learning opportunity. Document, learn, improve.

The Incident Response Playbook: From Detection to Post-Mortem

When Production Breaks at 2 AM

Why Most Incident Response Fails

The Complete Incident Response Framework

Part 1: Severity Classification (The Foundation)

The Severity Matrix

Quick Severity Decision Tree

Part 2: The Incident Commander System

Incident Commander (IC) Responsibilities

The IC Rotation

IC Checklist (First 15 Minutes)

Part 3: The War Room Protocol

War Room Setup (Template)

Communication Rules

Part 4: Customer Communication Templates

Status Page Update Template

Support Team Template

Part 5: The Decision Framework

Rollback vs Fix Forward

Scale Up vs Debug

Escalation Decision Points

Part 6: The Blameless Post-Mortem

The 5-Why Root Cause Analysis

Post-Mortem Template

Incident Response Metrics

Response Metrics

Quality Metrics

On-Call Best Practices

Rotation Structure

On-Call Handoff Template

Download: Incident Response Pack

Related Content

Ransomware Attack Playbook for CTOs: Decisions, Containment, Recovery, and Insurance

Run Incident Response Like a Bank: Discipline, Auditability, and Calm Under Fire

Managing Incidents at Scale: A Complete Playbook

IT Governance & Service Management Framework Selection: COBIT, ITIL, ISO 20000, and Beyond

Engineering Budget Planning & Forecasting: From Chaos to Control