Mean Time to Recovery (MTTR)
Measure how quickly your team restores service after an incident. A key DORA metric that indicates your organization's resilience.
Overview
Mean Time to Recovery (MTTR), also called Mean Time to Restore, measures how long it takes to restore service after a production incident. It's one of the four key DORA metrics and a critical indicator of your organization's resilience and incident response capabilities.
Why It Matters
- Customer impact: Faster recovery means less downtime
- Revenue protection: Every minute of downtime costs money
- Team stress: Quick recovery reduces firefighting burnout
- Resilience indicator: Shows your ability to handle failures
- Competitive advantage: Reliable systems build customer trust
- Innovation enabler: Fast recovery enables bolder experiments
The MTTR Formula
MTTR = Total Recovery Time ÷ Number of Incidents
Example:
- Incident 1: 30 minutes
- Incident 2: 2 hours
- Incident 3: 45 minutes
- Total: 3 hours 15 minutes (195 minutes)
- MTTR: 195 ÷ 3 = 65 minutes
Incident Lifecycle
Understanding what to measure:
Timeline:
─────────────────────────────────────────────────────────────
Detection → Acknowledgment → Response → Resolution → Verification
↓ ↓ ↓ ↓ ↓
Alert On-call Investigation Fix Confirm
fires responds begins deployed working
│←─── MTTD ───→│←────── MTTR ──────────────────→│
│←───────────── Total Incident Duration ─────────────────→│
Key Timestamps:
- Incident Start: When issue begins (may be before detection)
- Detection: When monitoring/alerts fire (MTTD)
- Acknowledgment: When on-call engineer responds
- Resolution Start: When fix is deployed
- Incident End: When service is fully restored
Recommended Visualizations
1. Histogram (Distribution)
Best for: Understanding recovery time patterns
X-axis: MTTR buckets (0-15m, 15-30m, 30m-1h, 1-4h, 4h+)
Y-axis: Number of incidents
Insight: Reveals if most incidents are quick vs. some very long
2. Percentile Chart (P50, P75, P95)
Best for: Tracking consistency
Y-axis: MTTR (minutes)
X-axis: Time (weeks/months)
Lines: P50 (median), P75, P95
Target: P95 < 1 hour for elite performance
⏱️ MTTR Improvement Over Time
Sample data showing consistent MTTR improvement. Median recovery time improved from 85 minutes to 52 minutes. Green line shows Elite threshold (< 60 minutes).
3. Gauge (Current Status)
Best for: Executive dashboards
Gauge ranges:
- Elite: < 1 hour (green)
- High: 1 hour - 1 day (blue)
- Medium: 1 day - 1 week (yellow)
- Low: > 1 week (red)
Display: Last 30 days MTTR
🎯 Current MTTR Performance
Mean Time to Recovery
Current MTTR of 42 minutes places the team in the Elite category (< 60 minutes). This indicates strong incident response capabilities.
4. Breakdown Chart (Components)
Best for: Identifying bottlenecks
Stacked bar showing:
- Detection time (MTTD)
- Response time (acknowledgment)
- Investigation time
- Fix implementation time
- Verification time
🔍 MTTR Component Breakdown
Investigation is the largest component at 15 minutes (36% of MTTR). Improving observability and runbooks can reduce this further.
Target Ranges (DORA Benchmarks)
| Performance Level | MTTR | |------------------|------| | Elite | Less than one hour | | High | Less than one day | | Medium | Between one day and one week | | Low | More than one week |
By Severity
| Severity | Target MTTR | Acceptable Max | |----------|------------|---------------| | P0 (Critical) | < 30 minutes | < 1 hour | | P1 (High) | < 2 hours | < 4 hours | | P2 (Medium) | < 1 day | < 3 days | | P3 (Low) | < 1 week | < 2 weeks |
By System
- Payment systems: < 15 minutes
- Core API: < 30 minutes
- User-facing features: < 1 hour
- Background jobs: < 4 hours
- Analytics: < 1 day
How to Improve MTTR
1. Improve Detection (Reduce MTTD)
Monitoring:
- Comprehensive metrics (RED: Rate, Errors, Duration)
- Real user monitoring (RUM)
- Synthetic monitoring for critical paths
- Log aggregation and analysis
Alerting:
- Smart alerts (reduce noise)
- Alert on symptoms, not causes
- Escalation policies
- Multiple notification channels
Example Alert:
alert: HighErrorRate
expr: error_rate > 5% for 2 minutes
severity: P1
description: "{{ $labels.service }} error rate at {{ $value }}%"
runbook: https://wiki.company.com/runbooks/high-error-rate
2. Improve Response
On-Call:
- Clear on-call schedules
- Sufficient on-call rotation (avoid burnout)
- Fair on-call compensation
- Defined response SLAs (P0: 5 min, P1: 15 min, P2: 1 hour)
Runbooks:
# Runbook: High API Error Rate
## Symptoms
- Error rate > 5%
- Alert: "HighAPIErrorRate"
## Immediate Actions (< 5 min)
1. Check dashboard: https://...
2. Check recent deployments
3. Check dependency status
## Investigation (5-15 min)
1. Review error logs
2. Check database connection pool
3. Verify third-party services
## Resolution Options
1. If recent deployment → rollback
2. If database → increase pool size
3. If third-party → enable circuit breaker
## Escalation
If not resolved in 30 min, page @backend-lead
3. Improve Investigation
Tools:
- Centralized logging (ELK, Splunk, DataDog)
- Distributed tracing (Jaeger, Zipkin, DataDog APM)
- APM tools (New Relic, AppDynamics)
- Database query analyzers
Practices:
- Structured logging
- Correlation IDs
- Service mesh observability
- Feature flags for quick rollback
4. Improve Fix Deployment
Fast Rollback:
# One-command rollback
./rollback.sh production
# Or automated based on error rate
if error_rate > 5% for 5min:
auto_rollback()
Fast Forward Fix:
- CI/CD pipeline optimized for hotfixes
- Skip non-critical checks in emergency mode
- Separate "hotfix" branch with fast-track deployment
5. Improve Verification
Smoke Tests:
# Automated post-deployment checks
check_api_health()
check_error_rate() < 1%
check_latency() < baseline * 1.1
check_key_metrics()
Gradual Rollout:
- Deploy fix to canary first
- Monitor for 5 minutes
- Gradually increase traffic
- Full rollout only if metrics healthy
Common Pitfalls
❌ Measuring Only Average
Problem: Average hides long-tail incidents Solution: Track P50, P75, P95, P99
❌ Starting Timer at Detection
Problem: Misses actual customer impact Solution: Measure from incident start (when issue began)
❌ Declaring Victory Too Early
Problem: Marking incident resolved before verification Solution: Include verification time in MTTR
❌ Not Segmenting by Severity
Problem: P3 incidents drag down metrics Solution: Track MTTR separately by severity
❌ Optimizing for the Metric
Problem: Closing incidents quickly without proper fix Solution: Track reopen rate and customer satisfaction
❌ Hero Culture
Problem: Relying on specific individuals to resolve incidents Solution: Document knowledge, distribute expertise
Implementation Guide
Week 1: Instrumentation
# Track incident timestamps
incident = {
'id': generate_id(),
'started_at': datetime.now(), # When issue began
'detected_at': None, # When alert fired
'acknowledged_at': None, # When on-call responded
'resolved_at': None, # When fix deployed
'verified_at': None, # When confirmed working
'severity': 'P1',
'service': 'api',
'root_cause': None
}
Week 2: Baseline
- Track all incidents for 2 weeks
- Calculate current MTTR
- Break down by component (detection, response, fix, verify)
- Identify slowest incidents and why
Week 3: Quick Wins
- Create runbooks for top 5 incident types
- Set up one-command rollback
- Add alerting for missing monitors
- Document on-call procedures
Week 4: Long-Term Improvements
- Implement distributed tracing
- Build automated remediation for common issues
- Create incident dashboard
- Start blameless post-mortems
Dashboard Example
Executive View
┌──────────────────────────────────────────────┐
│ MTTR: 42 minutes │
│ ██████████████████░░░░░░░ Elite │
│ │
│ Last 30 Days: │
│ • P50: 28 minutes │
│ • P95: 2.3 hours │
│ • Incidents: 15 │
│ │
│ Trend: ↓ 15% improvement vs. last month │
└──────────────────────────────────────────────┘
Operations View
Incident Breakdown (Last 30 Days)
─────────────────────────────────────────────────────
Severity Count MTTR Longest Shortest
─────────────────────────────────────────────────────
P0 2 18 min 25 min 12 min
P1 5 45 min 2.5 hours 15 min
P2 8 1.2 hours 4 hours 30 min
─────────────────────────────────────────────────────
Time Breakdown (Average)
─────────────────────────────────────────────────────
Detection: 8 minutes (19%)
Acknowledgment: 3 minutes (7%)
Investigation: 15 minutes (36%)
Fix Deploy: 10 minutes (24%)
Verification: 6 minutes (14%)
─────────────────────────────────────────────────────
Total: 42 minutes
Related Metrics
The Four Horsemen of Reliability:
- MTTR: How fast you recover
- MTTD (Mean Time to Detection): How fast you notice
- MTTA (Mean Time to Acknowledgment): How fast you respond
- MTBF (Mean Time Between Failures): How often you fail
Related DORA Metrics:
- Change Failure Rate: Are deployments causing incidents?
- Deployment Frequency: Does high velocity increase incidents?
- Lead Time: Can you deploy fixes quickly?
Business Metrics:
- System Uptime: Overall availability
- Customer Impact: How many users affected?
- Revenue Impact: Money lost during downtime
Tools & Integrations
Incident Management
- PagerDuty: Incident response platform
- Opsgenie: On-call and alert management
- Incident.io: Modern incident management
- FireHydrant: Incident response orchestration
- VictorOps (Splunk): Incident management
Monitoring & Observability
- DataDog: Full-stack monitoring
- New Relic: Application performance
- Prometheus + Grafana: Open-source monitoring
- Sentry: Error tracking
- Honeycomb: Observability platform
Communication
- Slack: Incident channels
- MS Teams: Incident response
- Zoom: Incident war rooms
- StatusPage: Customer communication
DIY Approach
import pandas as pd
from datetime import datetime
# Load incidents from database
incidents = pd.read_csv('incidents.csv')
# Calculate MTTR
incidents['mttr'] = (incidents['resolved_at'] - incidents['started_at']).dt.total_seconds() / 60
# Overall MTTR
print(f"MTTR: {incidents['mttr'].mean():.1f} minutes")
# By severity
print("\nMTTR by Severity:")
print(incidents.groupby('severity')['mttr'].agg(['mean', 'median', 'max']))
# Percentiles
print(f"\nP50: {incidents['mttr'].quantile(0.5):.1f} min")
print(f"P95: {incidents['mttr'].quantile(0.95):.1f} min")
Questions to Ask
For Leadership
- Are we responding to incidents fast enough?
- Do we have adequate on-call coverage?
- Are certain teams or services outliers?
- Do we need better tooling or training?
For Teams
- What slows down our incident response?
- Do we have good runbooks?
- Can we automate common fixes?
- Do we learn from every incident?
For Individuals
- How confident am I responding to incidents?
- Do I know where to look for information?
- Can I deploy a fix quickly?
- Who do I escalate to?
Success Stories
SaaS Company
- Before: 4.5-hour MTTR, manual response
- After: 22-minute MTTR, automated recovery
- Changes:
- Implemented one-click rollback
- Created comprehensive runbooks
- Automated common fixes (restart services, clear cache)
- Added distributed tracing
- Impact: 92% reduction in MTTR, 75% reduction in customer complaints
E-commerce Platform
- Before: 2-hour MTTR, frequent escalations
- After: 35-minute MTTR, self-service recovery
- Changes:
- Improved monitoring (5x more metrics)
- Real-time alerting vs. 5-minute delay
- Automated traffic shifting to healthy regions
- Weekly incident response drills
- Impact: 71% reduction in MTTR, $2M annual savings from reduced downtime
Advanced Topics
Chaos Engineering
Test your recovery time with controlled failures:
# Randomly kill services to test recovery
def chaos_monkey():
if random() < 0.01: # 1% chance
kill_random_instance()
start_timer()
monitor_recovery()
Automated Remediation
# Auto-heal common issues
rules:
- trigger: high_memory_usage
threshold: 90%
action: restart_service
- trigger: database_connection_pool_exhausted
action: scale_up_connections
- trigger: high_error_rate
threshold: 5%
duration: 5min
action: rollback_deployment
Game Days
Practice incident response:
- Monthly: Simulate incidents
- Rotate: Different team members lead
- Measure: Track MTTR during drills
- Improve: Refine runbooks based on learnings
Balancing MTTR with Other Metrics
Ideal State:
✅ High Deployment Frequency
✅ Low Change Failure Rate
✅ Fast Lead Time
✅ Low MTTR
Common Trade-offs:
⚖️ Fast recovery vs. thorough investigation
⚖️ Quick fix vs. root cause resolution
⚖️ Speed vs. prevention
Best Practice:
- Recover quickly (MTTR)
- Investigate thoroughly after (post-mortem)
- Implement preventive measures (reduce future incidents)
Conclusion
MTTR measures your resilience—not just your reliability. Elite teams recover from incidents in under an hour through comprehensive monitoring, clear runbooks, automated remediation, and regular practice. Focus on reducing each component: faster detection, quicker response, streamlined investigation, easy rollback, and automated verification. Remember: incidents will happen. What matters is how quickly you recover and what you learn from each one.
Start Today:
- Instrument your incident tracking
- Calculate your current MTTR
- Find your biggest bottleneck
- Implement one improvement
- Measure the impact
- Repeat