Mean Time to Recovery (MTTR)

Overview

Mean Time to Recovery (MTTR), also called Mean Time to Restore, measures how long it takes to restore service after a production incident. It's one of the four key DORA metrics and a critical indicator of your organization's resilience and incident response capabilities.

Why It Matters

Customer impact: Faster recovery means less downtime
Revenue protection: Every minute of downtime costs money
Team stress: Quick recovery reduces firefighting burnout
Resilience indicator: Shows your ability to handle failures
Competitive advantage: Reliable systems build customer trust
Innovation enabler: Fast recovery enables bolder experiments

The MTTR Formula

MTTR = Total Recovery Time ÷ Number of Incidents

Example:
- Incident 1: 30 minutes
- Incident 2: 2 hours
- Incident 3: 45 minutes
- Total: 3 hours 15 minutes (195 minutes)
- MTTR: 195 ÷ 3 = 65 minutes

Incident Lifecycle

Understanding what to measure:

Timeline:
─────────────────────────────────────────────────────────────
Detection → Acknowledgment → Response → Resolution → Verification
    ↓            ↓              ↓            ↓            ↓
  Alert        On-call      Investigation   Fix        Confirm
  fires        responds     begins          deployed   working
    │←─── MTTD ───→│←────── MTTR ──────────────────→│
    │←───────────── Total Incident Duration ─────────────────→│

Key Timestamps:

Incident Start: When issue begins (may be before detection)
Detection: When monitoring/alerts fire (MTTD)
Acknowledgment: When on-call engineer responds
Resolution Start: When fix is deployed
Incident End: When service is fully restored

Recommended Visualizations

1. Histogram (Distribution)

Best for: Understanding recovery time patterns

X-axis: MTTR buckets (0-15m, 15-30m, 30m-1h, 1-4h, 4h+)
Y-axis: Number of incidents
Insight: Reveals if most incidents are quick vs. some very long

2. Percentile Chart (P50, P75, P95)

Best for: Tracking consistency

Y-axis: MTTR (minutes)
X-axis: Time (weeks/months)
Lines: P50 (median), P75, P95
Target: P95 < 1 hour for elite performance

⏱️ MTTR Improvement Over Time

Sample data showing consistent MTTR improvement. Median recovery time improved from 85 minutes to 52 minutes. Green line shows Elite threshold (< 60 minutes).

3. Gauge (Current Status)

Best for: Executive dashboards

Gauge ranges:
- Elite: < 1 hour (green)
- High: 1 hour - 1 day (blue)
- Medium: 1 day - 1 week (yellow)
- Low: > 1 week (red)

Display: Last 30 days MTTR

🎯 Current MTTR Performance

Mean Time to Recovery

42 min

Elite (17.5%)

0 min240 min

Current MTTR of 42 minutes places the team in the Elite category (< 60 minutes). This indicates strong incident response capabilities.

4. Breakdown Chart (Components)

Best for: Identifying bottlenecks

Stacked bar showing:
- Detection time (MTTD)
- Response time (acknowledgment)
- Investigation time
- Fix implementation time
- Verification time

🔍 MTTR Component Breakdown

Investigation is the largest component at 15 minutes (36% of MTTR). Improving observability and runbooks can reduce this further.

Target Ranges (DORA Benchmarks)

Performance Level	MTTR
Elite	Less than one hour
High	Less than one day
Medium	Between one day and one week
Low	More than one week

By Severity

Severity	Target MTTR	Acceptable Max
P0 (Critical)	< 30 minutes	< 1 hour
P1 (High)	< 2 hours	< 4 hours
P2 (Medium)	< 1 day	< 3 days
P3 (Low)	< 1 week	< 2 weeks

By System

Payment systems: < 15 minutes
Core API: < 30 minutes
User-facing features: < 1 hour
Background jobs: < 4 hours
Analytics: < 1 day

How to Improve MTTR

1. Improve Detection (Reduce MTTD)

Monitoring:

Comprehensive metrics (RED: Rate, Errors, Duration)
Real user monitoring (RUM)
Synthetic monitoring for critical paths
Log aggregation and analysis

Alerting:

Smart alerts (reduce noise)
Alert on symptoms, not causes
Escalation policies
Multiple notification channels

Example Alert:

alert: HighErrorRate
expr: error_rate > 5% for 2 minutes
severity: P1
description: "{{ $labels.service }} error rate at {{ $value }}%"
runbook: https://wiki.company.com/runbooks/high-error-rate

2. Improve Response

On-Call:

Clear on-call schedules
Sufficient on-call rotation (avoid burnout)
Fair on-call compensation
Defined response SLAs (P0: 5 min, P1: 15 min, P2: 1 hour)

Runbooks:

# Runbook: High API Error Rate

## Symptoms
- Error rate > 5%
- Alert: "HighAPIErrorRate"

## Immediate Actions (< 5 min)
1. Check dashboard: https://...
2. Check recent deployments
3. Check dependency status

## Investigation (5-15 min)
1. Review error logs
2. Check database connection pool
3. Verify third-party services

## Resolution Options
1. If recent deployment → rollback
2. If database → increase pool size
3. If third-party → enable circuit breaker

## Escalation
If not resolved in 30 min, page @backend-lead

3. Improve Investigation

Tools:

Centralized logging (ELK, Splunk, DataDog)
Distributed tracing (Jaeger, Zipkin, DataDog APM)
APM tools (New Relic, AppDynamics)
Database query analyzers

Practices:

Structured logging
Correlation IDs
Service mesh observability
Feature flags for quick rollback

4. Improve Fix Deployment

Fast Rollback:

# One-command rollback
./rollback.sh production

# Or automated based on error rate
if error_rate > 5% for 5min:
  auto_rollback()

Fast Forward Fix:

CI/CD pipeline optimized for hotfixes
Skip non-critical checks in emergency mode
Separate "hotfix" branch with fast-track deployment

5. Improve Verification

Smoke Tests:

# Automated post-deployment checks
check_api_health()
check_error_rate() < 1%
check_latency() < baseline * 1.1
check_key_metrics()

Gradual Rollout:

Deploy fix to canary first
Monitor for 5 minutes
Gradually increase traffic
Full rollout only if metrics healthy

Common Pitfalls

❌ Measuring Only Average

Problem: Average hides long-tail incidents Solution: Track P50, P75, P95, P99

❌ Starting Timer at Detection

Problem: Misses actual customer impact Solution: Measure from incident start (when issue began)

❌ Declaring Victory Too Early

Problem: Marking incident resolved before verification Solution: Include verification time in MTTR

❌ Not Segmenting by Severity

Problem: P3 incidents drag down metrics Solution: Track MTTR separately by severity

❌ Optimizing for the Metric

Problem: Closing incidents quickly without proper fix Solution: Track reopen rate and customer satisfaction

❌ Hero Culture

Problem: Relying on specific individuals to resolve incidents Solution: Document knowledge, distribute expertise

Implementation Guide

Week 1: Instrumentation

# Track incident timestamps
incident = {
  'id': generate_id(),
  'started_at': datetime.now(),  # When issue began
  'detected_at': None,           # When alert fired
  'acknowledged_at': None,       # When on-call responded
  'resolved_at': None,           # When fix deployed
  'verified_at': None,           # When confirmed working
  'severity': 'P1',
  'service': 'api',
  'root_cause': None
}

Week 2: Baseline

Track all incidents for 2 weeks
Calculate current MTTR
Break down by component (detection, response, fix, verify)
Identify slowest incidents and why

Week 3: Quick Wins

Create runbooks for top 5 incident types
Set up one-command rollback
Add alerting for missing monitors
Document on-call procedures

Week 4: Long-Term Improvements

Implement distributed tracing
Build automated remediation for common issues
Create incident dashboard
Start blameless post-mortems

Dashboard Example

Executive View

┌──────────────────────────────────────────────┐
│ MTTR: 42 minutes                             │
│ ██████████████████░░░░░░░ Elite              │
│                                              │
│ Last 30 Days:                                │
│ • P50: 28 minutes                            │
│ • P95: 2.3 hours                             │
│ • Incidents: 15                              │
│                                              │
│ Trend: ↓ 15% improvement vs. last month     │
└──────────────────────────────────────────────┘

Operations View

Incident Breakdown (Last 30 Days)
─────────────────────────────────────────────────────
Severity  Count  MTTR      Longest     Shortest
─────────────────────────────────────────────────────
P0        2      18 min    25 min      12 min
P1        5      45 min    2.5 hours   15 min
P2        8      1.2 hours 4 hours     30 min
─────────────────────────────────────────────────────

Time Breakdown (Average)
─────────────────────────────────────────────────────
Detection:      8 minutes    (19%)
Acknowledgment: 3 minutes    (7%)
Investigation:  15 minutes   (36%)
Fix Deploy:     10 minutes   (24%)
Verification:   6 minutes    (14%)
─────────────────────────────────────────────────────
Total:          42 minutes

The Four Horsemen of Reliability:

MTTR: How fast you recover
MTTD (Mean Time to Detection): How fast you notice
MTTA (Mean Time to Acknowledgment): How fast you respond
MTBF (Mean Time Between Failures): How often you fail

Related DORA Metrics:

Change Failure Rate: Are deployments causing incidents?
Deployment Frequency: Does high velocity increase incidents?
Lead Time: Can you deploy fixes quickly?

Business Metrics:

System Uptime: Overall availability
Customer Impact: How many users affected?
Revenue Impact: Money lost during downtime

Tools & Integrations

Incident Management

PagerDuty: Incident response platform
Opsgenie: On-call and alert management
Incident.io: Modern incident management
FireHydrant: Incident response orchestration
VictorOps (Splunk): Incident management

Monitoring & Observability

DataDog: Full-stack monitoring
New Relic: Application performance
Prometheus + Grafana: Open-source monitoring
Sentry: Error tracking
Honeycomb: Observability platform

Communication

Slack: Incident channels
MS Teams: Incident response
Zoom: Incident war rooms
StatusPage: Customer communication

DIY Approach

import pandas as pd
from datetime import datetime

# Load incidents from database
incidents = pd.read_csv('incidents.csv')

# Calculate MTTR
incidents['mttr'] = (incidents['resolved_at'] - incidents['started_at']).dt.total_seconds() / 60

# Overall MTTR
print(f"MTTR: {incidents['mttr'].mean():.1f} minutes")

# By severity
print("\nMTTR by Severity:")
print(incidents.groupby('severity')['mttr'].agg(['mean', 'median', 'max']))

# Percentiles
print(f"\nP50: {incidents['mttr'].quantile(0.5):.1f} min")
print(f"P95: {incidents['mttr'].quantile(0.95):.1f} min")

Questions to Ask

For Leadership

Are we responding to incidents fast enough?
Do we have adequate on-call coverage?
Are certain teams or services outliers?
Do we need better tooling or training?

For Teams

What slows down our incident response?
Do we have good runbooks?
Can we automate common fixes?
Do we learn from every incident?

For Individuals

How confident am I responding to incidents?
Do I know where to look for information?
Can I deploy a fix quickly?
Who do I escalate to?

Success Stories

SaaS Company

Before: 4.5-hour MTTR, manual response
After: 22-minute MTTR, automated recovery
Changes:
- Implemented one-click rollback
- Created comprehensive runbooks
- Automated common fixes (restart services, clear cache)
- Added distributed tracing
Impact: 92% reduction in MTTR, 75% reduction in customer complaints

E-commerce Platform

Before: 2-hour MTTR, frequent escalations
After: 35-minute MTTR, self-service recovery
Changes:
- Improved monitoring (5x more metrics)
- Real-time alerting vs. 5-minute delay
- Automated traffic shifting to healthy regions
- Weekly incident response drills
Impact: 71% reduction in MTTR, $2M annual savings from reduced downtime

Advanced Topics

Chaos Engineering

Test your recovery time with controlled failures:

# Randomly kill services to test recovery
def chaos_monkey():
    if random() < 0.01:  # 1% chance
        kill_random_instance()
        start_timer()
        monitor_recovery()

Automated Remediation

# Auto-heal common issues
rules:
  - trigger: high_memory_usage
    threshold: 90%
    action: restart_service

  - trigger: database_connection_pool_exhausted
    action: scale_up_connections

  - trigger: high_error_rate
    threshold: 5%
    duration: 5min
    action: rollback_deployment

Game Days

Practice incident response:

Monthly: Simulate incidents
Rotate: Different team members lead
Measure: Track MTTR during drills
Improve: Refine runbooks based on learnings

Balancing MTTR with Other Metrics

Ideal State:
✅ High Deployment Frequency
✅ Low Change Failure Rate
✅ Fast Lead Time
✅ Low MTTR

Common Trade-offs:
⚖️ Fast recovery vs. thorough investigation
⚖️ Quick fix vs. root cause resolution
⚖️ Speed vs. prevention

Best Practice:

Recover quickly (MTTR)
Investigate thoroughly after (post-mortem)
Implement preventive measures (reduce future incidents)

Conclusion

MTTR measures your resilience—not just your reliability. Elite teams recover from incidents in under an hour through comprehensive monitoring, clear runbooks, automated remediation, and regular practice. Focus on reducing each component: faster detection, quicker response, streamlined investigation, easy rollback, and automated verification. Remember: incidents will happen. What matters is how quickly you recover and what you learn from each one.

Start Today:

Instrument your incident tracking
Calculate your current MTTR
Find your biggest bottleneck
Implement one improvement
Measure the impact
Repeat