The Incident Response Playbook: From Detection to Post-Mortem
A battle-tested framework for handling production incidents—from the first alert to the blameless post-mortem. Includes severity classification, escalation playbooks, communication templates, and lessons from real outages.
When Production Breaks at 2 AM
Your phone vibrates. PagerDuty alert: "API Error Rate above 10%". Your first thought: "How bad is this?" Your second: "Who do I wake up?"
By the time you assess severity, identify the root cause, and coordinate a fix, an hour has passed. Customers are angry. The sales team is panicking. The CEO is asking questions you can't answer yet.
You need a playbook—not just for fixing incidents, but for managing the chaos that comes with them.
Why Most Incident Response Fails
The technical fix is usually straightforward: roll back the deploy, restart the service, scale up capacity. The hard part is everything else:
- Severity assessment takes 20 minutes because there's no clear definition
- Escalation is ad-hoc ("Should I wake the CTO?")
- Communication is fragmented (Slack, email, PagerDuty, calls)
- Customer updates are delayed or confusing
- Post-mortems never happen or turn into blame sessions
Without a framework, every incident becomes a crisis.
The Complete Incident Response Framework
Part 1: Severity Classification (The Foundation)
Every incident needs a severity level within 5 minutes of detection. This drives everything else: who gets paged, how fast you respond, and whether you wake the CEO.
The Severity Matrix
SEV1: Critical
- Impact: Complete outage or data loss
- Customers Affected: Over 50% or all enterprise customers
- Revenue Impact: Direct revenue loss
- Response Time: Immediate
- Escalation: Page on-call + EM + CTO
- Communication: Every 30 minutes
Examples:
- Login is down (nobody can access the product)
- Payment processing failures
- Data breach or security incident
- Complete service outage
SEV2: High
- Impact: Major feature broken or severe degradation
- Customers Affected: 10-50% or multiple key accounts
- Revenue Impact: Indirect (churn risk)
- Response Time: 15 minutes
- Escalation: Page on-call + EM
- Communication: Hourly updates
Examples:
- Search is down but core product works
- API latency 10x normal
- Email delivery delayed 2+ hours
- Database read replica failed (redundancy lost)
SEV3: Medium
- Impact: Minor feature broken or performance issue
- Customers Affected: Under 10% or specific segment
- Revenue Impact: None immediate
- Response Time: 1 hour
- Escalation: Notify on-call (no page)
- Communication: Initial + resolution updates only
Examples:
- Export to PDF broken
- Dashboard loading slowly
- Non-critical API endpoints erroring
- Scheduled job delayed
SEV4: Low
- Impact: Cosmetic or minimal functionality
- Customers Affected: Few or none
- Revenue Impact: None
- Response Time: Next business day
- Escalation: Create ticket
- Communication: None (internal only)
Examples:
- Typo on landing page
- Broken link in documentation
- Internal dashboard bug
- Non-customer-facing error
Quick Severity Decision Tree
Is the product completely unusable?
YES -> SEV1
NO:
Can customers complete their core workflow?
NO -> SEV2
YES:
Are more than 10% of customers affected?
YES -> SEV3
NO -> SEV4
Template for Slack:
⚠️ INCIDENT DECLARED - SEV2
Service: API
Impact: 500 errors on /api/orders
Customers: ~30% seeing errors
IC: @alice
War Room: #incident-2025-01-18-001
Status Page: Updated
Next Update: 11:30 AM
Part 2: The Incident Commander System
Key Principle: One person owns the incident, even if they're not the most technical.
Incident Commander (IC) Responsibilities
NOT:
- ❌ Fixing the incident (that's the engineer's job)
- ❌ Finding root cause (happens later)
YES:
- ✅ Coordinating response
- ✅ Making executive decisions (rollback now vs investigate)
- ✅ Managing communication
- ✅ Timekeeping and documentation
The IC Rotation
Who Can Be IC:
- Senior engineers (familiar with systems)
- Engineering managers
- Technical product managers
- NOT: Junior engineers (in first 6 months)
Rotation Schedule:
Week 1: Alice (IC Primary), Bob (IC Shadow)
Week 2: Bob (IC Primary), Carol (IC Shadow)
Week 3: Carol (IC Primary), David (IC Shadow)
...
Shadow IC: Observes, takes notes, learns—becomes primary next week
IC Checklist (First 15 Minutes)
- [ ] Declare severity level
- [ ] Create war room (#incident-YYYY-MM-DD-NNN)
- [ ] Assign roles:
- [ ] IC (you)
- [ ] Tech Lead (debugging)
- [ ] Comms Lead (customer updates)
- [ ] Scribe (timeline notes)
- [ ] Update status page
- [ ] Page additional responders if needed
- [ ] Set timer for next update (30 min for SEV1, 60 min for SEV2)
Part 3: The War Room Protocol
All incident coordination happens in ONE place: The war room channel
War Room Setup (Template)
Channel Name: #incident-2025-01-18-001 (date + sequence)
Channel Topic:
SEV2 | API Errors | IC: @alice | Started: 10:15 AM | Customers: ~30%
Pinned Message (Living Document):
## INCIDENT SUMMARY
**Severity**: SEV2
**Status**: INVESTIGATING
**Started**: 2025-01-18 10:15 AM UTC
**Service**: API (orders endpoint)
**Impact**: 30% of order requests failing with 500 errors
## ROLES
- IC: @alice
- Tech Lead: @bob (debugging)
- Comms: @carol (status page, support)
- Scribe: @david (timeline)
## TIMELINE
10:15 - Alert triggered (error rate above 10%)
10:17 - IC declared SEV2
10:20 - War room created
10:25 - Initial investigation: DB connection pool exhausted
10:35 - Scaled DB connections from 100 → 200
10:40 - Error rate dropped to 2%
10:45 - RESOLVED
## CUSTOMER IMPACT
- 2,450 failed order attempts
- Peak error rate: 35%
- Duration: 30 minutes
## NEXT STEPS
- [ ] Post-mortem scheduled: 2025-01-19 2 PM
- [ ] Root cause: Connection pool sizing
Communication Rules
DO:
- ✅ All updates in war room thread
- ✅ Tag people for specific asks
- ✅ Use threaded replies (keeps main channel clean)
DON'T:
- ❌ DM the IC (they won't see it)
- ❌ Discuss in multiple channels
- ❌ Debate solutions (try it or defer to post-mortem)
Part 4: Customer Communication Templates
Status Page Update Template
Initial Update (Within 5 minutes of SEV1/SEV2):
🔴 Investigating - API Errors
We are investigating an increase in API errors affecting order processing.
Some customers may experience failed transactions.
Our team is actively working on a resolution.
Started: 10:15 AM UTC
Next update: 10:45 AM UTC
Progress Update:
🟡 Identified - API Errors
We've identified the root cause as a database connection issue and are
implementing a fix. Error rates have decreased from 35% to 10%.
Current impact: Orders may still fail intermittently.
Next update: 11:15 AM UTC or when resolved.
Resolution Update:
🟢 Resolved - API Errors
The incident has been resolved. All services are operating normally.
Summary:
- Duration: 30 minutes (10:15 AM - 10:45 AM UTC)
- Cause: Database connection pool exhausted during traffic spike
- Resolution: Increased connection pool capacity
- Impact: ~2,450 failed order attempts
We apologize for the disruption and have implemented additional monitoring
to prevent recurrence.
Support Team Template
Subject: INCIDENT ALERT - SEV2 - API Errors
Priority: HIGH
Status: INVESTIGATING
**What customers will see:**
- Failed order submissions
- "Something went wrong" error message
- Orders not appearing in dashboard
**What to tell customers:**
- "We're aware of an issue affecting order processing and are working on it"
- "No data has been lost - your order will complete once resolved"
- ETA: Within 1 hour
**Do NOT say:**
- Database is down (too technical)
- We don't know what's wrong (creates panic)
**Escalation:**
If customer is enterprise or threatening to churn → Alert @sales
**Updates:**
War room: #incident-2025-01-18-001
Next update: 11:00 AM
Part 5: The Decision Framework
Incidents require fast decisions. Use this framework:
Rollback vs Fix Forward
Rollback IF:
- ✅ Deploy was in last 2 hours
- ✅ Previous version was stable
- ✅ Rollback takes under 5 minutes
- ✅ You're not 100% sure of the fix
Fix Forward IF:
- ✅ Root cause is clear and simple
- ✅ Fix can be deployed in under 10 minutes
- ✅ Rollback would cause data loss
- ✅ Issue existed before last deploy
Example:
- Bug introduced in deploy 30 minutes ago → Rollback
- Database migration failing → Fix forward (can't undo migration)
- Config change caused issue → Rollback to previous config
Scale Up vs Debug
Scale Up First IF:
- Traffic spike overwhelming system
- Error rate above 20%
- Scaling takes under 2 minutes (horizontal pod autoscaling)
Debug First IF:
- Error rate under 10%
- Scaling won't help (logic bug)
- Already at max capacity
Parallel Approach (SEV1):
- IC: "Bob, start scaling up API pods"
- IC: "Alice, debug root cause in parallel"
- IC: "Whoever finishes first, we go with that solution"
Escalation Decision Points
Page the CTO if:
- SEV1 and not resolved in 30 minutes
- Security breach or data loss
- Media/PR implications
- Customer demanding executive attention
Page CEO if:
- Multi-hour SEV1 outage
- Security breach with data exfiltration
- Legal/regulatory notification required
Part 6: The Blameless Post-Mortem
When: Within 48 hours of resolution (while memory is fresh)
Who Attends:
- IC
- All incident responders
- Engineering leads
- Product/design if customer-facing
Duration: 60 minutes max
The 5-Why Root Cause Analysis
Example Incident: API errors due to DB connection exhaustion
Why did the API fail?
→ Database connection pool was exhausted
Why was the connection pool exhausted?
→ Traffic spiked 3x normal levels
Why didn't the connection pool scale?
→ It was hard-coded to 100 connections
Why was it hard-coded?
→ Original configuration from 2 years ago, never updated
Why wasn't it updated?
→ No monitoring alert for connection pool saturation
ROOT CAUSE: Missing monitoring + static configuration
Post-Mortem Template
# Post-Mortem: API Errors - 2025-01-18
**Severity**: SEV2
**Duration**: 30 minutes (10:15 AM - 10:45 AM UTC)
**Impact**: 2,450 failed orders, ~$12K revenue loss
## Summary
A traffic spike exhausted our database connection pool, causing 35% of
API requests to fail. The incident was resolved by increasing the pool
size and is being prevented via dynamic scaling.
## Timeline
- 10:15 - Monitoring alert: API error rate above 10%
- 10:17 - IC declared SEV2, created war room
- 10:20 - Status page updated
- 10:25 - Root cause identified: DB connection pool
- 10:35 - Scaled pool from 100 → 200 connections
- 10:40 - Error rate dropped to normal (under 2%)
- 10:45 - Incident resolved
## Root Cause (5 Whys)
[See above example]
## What Went Well
- ✅ Fast detection (2 minutes from spike to alert)
- ✅ Clear IC leadership
- ✅ Good war room discipline
- ✅ Quick fix once root cause identified
## What Went Wrong
- ❌ No proactive connection pool monitoring
- ❌ Static configuration not reviewed in 2 years
- ❌ Delay in initial status page update (10 minutes)
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add connection pool monitoring | @bob | 2025-01-20 | P0 |
| Implement dynamic pool scaling | @alice | 2025-01-25 | P0 |
| Audit all static configs | @team | 2025-02-01 | P1 |
| Automate status page updates | @carol | 2025-01-22 | P1 |
## Lessons Learned
1. **Monitoring gap**: We monitor error rates but not resource saturation
2. **Configuration debt**: Static configs become stale over time
3. **Communication**: War room worked well, but status page was slow
## Prevention
- Dynamic connection pooling based on traffic
- Weekly config review process
- Added to incident response checklist: Update status page within 5 min
Key: Focus on systems, not people. "Configuration was static" not "Bob forgot to update config"
Incident Response Metrics
Track these to improve over time:
Response Metrics
- MTTD (Mean Time to Detect): Alert to human acknowledgment
- MTTA (Mean Time to Acknowledge): Alert to IC assigned
- MTTI (Mean Time to Investigate): IC assigned to root cause found
- MTTR (Mean Time to Resolve): Root cause to full resolution
Targets:
- MTTD: Under 5 minutes (automated alerting)
- MTTA: Under 10 minutes (on-call response)
- MTTI: Under 30 minutes for SEV1, under 2 hours for SEV2
- MTTR: Under 1 hour for SEV1, under 4 hours for SEV2
Quality Metrics
- Repeat Incidents: Same root cause within 90 days (under 10%)
- Action Item Completion: Post-mortem tasks done on time (over 90%)
- False Positive Rate: Alerts that aren't real incidents (under 20%)
On-Call Best Practices
Rotation Structure
Primary + Secondary:
- Primary on-call: First responder (pages immediately)
- Secondary on-call: Backup if primary doesn't ack in 5 minutes
Duration: 1 week shifts (Monday 9 AM to Monday 9 AM)
Compensation:
- $X per on-call shift (even if no incidents)
- $Y per hour during incident response
- Or: Time-in-lieu (1.5x hours worked)
On-Call Handoff Template
Hey @next-person, you're on-call starting Monday 9 AM.
**Recent Incidents**:
- 2025-01-18: SEV2 API errors (resolved, post-mortem done)
- 2025-01-15: SEV3 slow dashboard (still investigating)
**Ongoing Issues**:
- Database replica lag higher than normal (monitoring)
- Memory leak in worker service (fix deploying Friday)
**Deploy Schedule**:
- Tuesday 2 PM: API v2.1.0
- Thursday 10 AM: Database migration
**Escalation Contacts**:
- IC: me (until Monday 9 AM), then you
- CTO: @cto (SEV1 over 30min or security)
- Database: @dba (if DB-related)
**Good luck! Last week was quiet, hopefully this one is too.**
Download: Incident Response Pack
[Coming soon: Complete template bundle]
Includes:
- Severity classification matrix
- IC runbook (checklist format)
- War room setup script
- Customer communication templates
- Post-mortem template
- On-call handoff template
- Action item tracker
Remember: The goal isn't zero incidents—it's fast resolution and continuous improvement. Every incident is a learning opportunity. Document, learn, improve.