Skip to main content
Featured

Mean Time to Recovery (MTTR)

November 10, 2025By The Art of CTO14 min read
...
metrics

Measure how quickly your team restores service after an incident. A key DORA metric that indicates your organization's resilience.

Type:reliability
Tracking: real-time
Difficulty:medium
Measurement: Average time from incident start to resolution
Target Range: Elite: < 1 hour | High: < 1 day | Medium: < 1 week | Low: > 1 week
Recommended Visualizations:histogram, line-chart, percentile-chart, gauge
Data Sources:PagerDuty, Opsgenie, Incident.io, Jira Service Management

Overview

Mean Time to Recovery (MTTR), also called Mean Time to Restore, measures how long it takes to restore service after a production incident. It's one of the four key DORA metrics and a critical indicator of your organization's resilience and incident response capabilities.

Why It Matters

  • Customer impact: Faster recovery means less downtime
  • Revenue protection: Every minute of downtime costs money
  • Team stress: Quick recovery reduces firefighting burnout
  • Resilience indicator: Shows your ability to handle failures
  • Competitive advantage: Reliable systems build customer trust
  • Innovation enabler: Fast recovery enables bolder experiments

The MTTR Formula

MTTR = Total Recovery Time ÷ Number of Incidents

Example:
- Incident 1: 30 minutes
- Incident 2: 2 hours
- Incident 3: 45 minutes
- Total: 3 hours 15 minutes (195 minutes)
- MTTR: 195 ÷ 3 = 65 minutes

Incident Lifecycle

Understanding what to measure:

Timeline:
─────────────────────────────────────────────────────────────
Detection → Acknowledgment → Response → Resolution → Verification
    ↓            ↓              ↓            ↓            ↓
  Alert        On-call      Investigation   Fix        Confirm
  fires        responds     begins          deployed   working
    │←─── MTTD ───→│←────── MTTR ──────────────────→│
    │←───────────── Total Incident Duration ─────────────────→│

Key Timestamps:

  1. Incident Start: When issue begins (may be before detection)
  2. Detection: When monitoring/alerts fire (MTTD)
  3. Acknowledgment: When on-call engineer responds
  4. Resolution Start: When fix is deployed
  5. Incident End: When service is fully restored

1. Histogram (Distribution)

Best for: Understanding recovery time patterns

X-axis: MTTR buckets (0-15m, 15-30m, 30m-1h, 1-4h, 4h+)
Y-axis: Number of incidents
Insight: Reveals if most incidents are quick vs. some very long

2. Percentile Chart (P50, P75, P95)

Best for: Tracking consistency

Y-axis: MTTR (minutes)
X-axis: Time (weeks/months)
Lines: P50 (median), P75, P95
Target: P95 < 1 hour for elite performance

⏱️ MTTR Improvement Over Time

Sample data showing consistent MTTR improvement. Median recovery time improved from 85 minutes to 52 minutes. Green line shows Elite threshold (< 60 minutes).

3. Gauge (Current Status)

Best for: Executive dashboards

Gauge ranges:
- Elite: < 1 hour (green)
- High: 1 hour - 1 day (blue)
- Medium: 1 day - 1 week (yellow)
- Low: > 1 week (red)

Display: Last 30 days MTTR

🎯 Current MTTR Performance

Mean Time to Recovery

42 min
Elite (17.5%)
0 min240 min

Current MTTR of 42 minutes places the team in the Elite category (< 60 minutes). This indicates strong incident response capabilities.

4. Breakdown Chart (Components)

Best for: Identifying bottlenecks

Stacked bar showing:
- Detection time (MTTD)
- Response time (acknowledgment)
- Investigation time
- Fix implementation time
- Verification time

🔍 MTTR Component Breakdown

Investigation is the largest component at 15 minutes (36% of MTTR). Improving observability and runbooks can reduce this further.

Target Ranges (DORA Benchmarks)

| Performance Level | MTTR | |------------------|------| | Elite | Less than one hour | | High | Less than one day | | Medium | Between one day and one week | | Low | More than one week |

By Severity

| Severity | Target MTTR | Acceptable Max | |----------|------------|---------------| | P0 (Critical) | < 30 minutes | < 1 hour | | P1 (High) | < 2 hours | < 4 hours | | P2 (Medium) | < 1 day | < 3 days | | P3 (Low) | < 1 week | < 2 weeks |

By System

  • Payment systems: < 15 minutes
  • Core API: < 30 minutes
  • User-facing features: < 1 hour
  • Background jobs: < 4 hours
  • Analytics: < 1 day

How to Improve MTTR

1. Improve Detection (Reduce MTTD)

Monitoring:

  • Comprehensive metrics (RED: Rate, Errors, Duration)
  • Real user monitoring (RUM)
  • Synthetic monitoring for critical paths
  • Log aggregation and analysis

Alerting:

  • Smart alerts (reduce noise)
  • Alert on symptoms, not causes
  • Escalation policies
  • Multiple notification channels

Example Alert:

alert: HighErrorRate
expr: error_rate > 5% for 2 minutes
severity: P1
description: "{{ $labels.service }} error rate at {{ $value }}%"
runbook: https://wiki.company.com/runbooks/high-error-rate

2. Improve Response

On-Call:

  • Clear on-call schedules
  • Sufficient on-call rotation (avoid burnout)
  • Fair on-call compensation
  • Defined response SLAs (P0: 5 min, P1: 15 min, P2: 1 hour)

Runbooks:

# Runbook: High API Error Rate

## Symptoms
- Error rate > 5%
- Alert: "HighAPIErrorRate"

## Immediate Actions (< 5 min)
1. Check dashboard: https://...
2. Check recent deployments
3. Check dependency status

## Investigation (5-15 min)
1. Review error logs
2. Check database connection pool
3. Verify third-party services

## Resolution Options
1. If recent deployment → rollback
2. If database → increase pool size
3. If third-party → enable circuit breaker

## Escalation
If not resolved in 30 min, page @backend-lead

3. Improve Investigation

Tools:

  • Centralized logging (ELK, Splunk, DataDog)
  • Distributed tracing (Jaeger, Zipkin, DataDog APM)
  • APM tools (New Relic, AppDynamics)
  • Database query analyzers

Practices:

  • Structured logging
  • Correlation IDs
  • Service mesh observability
  • Feature flags for quick rollback

4. Improve Fix Deployment

Fast Rollback:

# One-command rollback
./rollback.sh production

# Or automated based on error rate
if error_rate > 5% for 5min:
  auto_rollback()

Fast Forward Fix:

  • CI/CD pipeline optimized for hotfixes
  • Skip non-critical checks in emergency mode
  • Separate "hotfix" branch with fast-track deployment

5. Improve Verification

Smoke Tests:

# Automated post-deployment checks
check_api_health()
check_error_rate() < 1%
check_latency() < baseline * 1.1
check_key_metrics()

Gradual Rollout:

  • Deploy fix to canary first
  • Monitor for 5 minutes
  • Gradually increase traffic
  • Full rollout only if metrics healthy

Common Pitfalls

❌ Measuring Only Average

Problem: Average hides long-tail incidents Solution: Track P50, P75, P95, P99

❌ Starting Timer at Detection

Problem: Misses actual customer impact Solution: Measure from incident start (when issue began)

❌ Declaring Victory Too Early

Problem: Marking incident resolved before verification Solution: Include verification time in MTTR

❌ Not Segmenting by Severity

Problem: P3 incidents drag down metrics Solution: Track MTTR separately by severity

❌ Optimizing for the Metric

Problem: Closing incidents quickly without proper fix Solution: Track reopen rate and customer satisfaction

❌ Hero Culture

Problem: Relying on specific individuals to resolve incidents Solution: Document knowledge, distribute expertise

Implementation Guide

Week 1: Instrumentation

# Track incident timestamps
incident = {
  'id': generate_id(),
  'started_at': datetime.now(),  # When issue began
  'detected_at': None,           # When alert fired
  'acknowledged_at': None,       # When on-call responded
  'resolved_at': None,           # When fix deployed
  'verified_at': None,           # When confirmed working
  'severity': 'P1',
  'service': 'api',
  'root_cause': None
}

Week 2: Baseline

  • Track all incidents for 2 weeks
  • Calculate current MTTR
  • Break down by component (detection, response, fix, verify)
  • Identify slowest incidents and why

Week 3: Quick Wins

  • Create runbooks for top 5 incident types
  • Set up one-command rollback
  • Add alerting for missing monitors
  • Document on-call procedures

Week 4: Long-Term Improvements

  • Implement distributed tracing
  • Build automated remediation for common issues
  • Create incident dashboard
  • Start blameless post-mortems

Dashboard Example

Executive View

┌──────────────────────────────────────────────┐
│ MTTR: 42 minutes                             │
│ ██████████████████░░░░░░░ Elite              │
│                                              │
│ Last 30 Days:                                │
│ • P50: 28 minutes                            │
│ • P95: 2.3 hours                             │
│ • Incidents: 15                              │
│                                              │
│ Trend: ↓ 15% improvement vs. last month     │
└──────────────────────────────────────────────┘

Operations View

Incident Breakdown (Last 30 Days)
─────────────────────────────────────────────────────
Severity  Count  MTTR      Longest     Shortest
─────────────────────────────────────────────────────
P0        2      18 min    25 min      12 min
P1        5      45 min    2.5 hours   15 min
P2        8      1.2 hours 4 hours     30 min
─────────────────────────────────────────────────────

Time Breakdown (Average)
─────────────────────────────────────────────────────
Detection:      8 minutes    (19%)
Acknowledgment: 3 minutes    (7%)
Investigation:  15 minutes   (36%)
Fix Deploy:     10 minutes   (24%)
Verification:   6 minutes    (14%)
─────────────────────────────────────────────────────
Total:          42 minutes

The Four Horsemen of Reliability:

  • MTTR: How fast you recover
  • MTTD (Mean Time to Detection): How fast you notice
  • MTTA (Mean Time to Acknowledgment): How fast you respond
  • MTBF (Mean Time Between Failures): How often you fail

Related DORA Metrics:

  • Change Failure Rate: Are deployments causing incidents?
  • Deployment Frequency: Does high velocity increase incidents?
  • Lead Time: Can you deploy fixes quickly?

Business Metrics:

  • System Uptime: Overall availability
  • Customer Impact: How many users affected?
  • Revenue Impact: Money lost during downtime

Tools & Integrations

Incident Management

  • PagerDuty: Incident response platform
  • Opsgenie: On-call and alert management
  • Incident.io: Modern incident management
  • FireHydrant: Incident response orchestration
  • VictorOps (Splunk): Incident management

Monitoring & Observability

  • DataDog: Full-stack monitoring
  • New Relic: Application performance
  • Prometheus + Grafana: Open-source monitoring
  • Sentry: Error tracking
  • Honeycomb: Observability platform

Communication

  • Slack: Incident channels
  • MS Teams: Incident response
  • Zoom: Incident war rooms
  • StatusPage: Customer communication

DIY Approach

import pandas as pd
from datetime import datetime

# Load incidents from database
incidents = pd.read_csv('incidents.csv')

# Calculate MTTR
incidents['mttr'] = (incidents['resolved_at'] - incidents['started_at']).dt.total_seconds() / 60

# Overall MTTR
print(f"MTTR: {incidents['mttr'].mean():.1f} minutes")

# By severity
print("\nMTTR by Severity:")
print(incidents.groupby('severity')['mttr'].agg(['mean', 'median', 'max']))

# Percentiles
print(f"\nP50: {incidents['mttr'].quantile(0.5):.1f} min")
print(f"P95: {incidents['mttr'].quantile(0.95):.1f} min")

Questions to Ask

For Leadership

  • Are we responding to incidents fast enough?
  • Do we have adequate on-call coverage?
  • Are certain teams or services outliers?
  • Do we need better tooling or training?

For Teams

  • What slows down our incident response?
  • Do we have good runbooks?
  • Can we automate common fixes?
  • Do we learn from every incident?

For Individuals

  • How confident am I responding to incidents?
  • Do I know where to look for information?
  • Can I deploy a fix quickly?
  • Who do I escalate to?

Success Stories

SaaS Company

  • Before: 4.5-hour MTTR, manual response
  • After: 22-minute MTTR, automated recovery
  • Changes:
    • Implemented one-click rollback
    • Created comprehensive runbooks
    • Automated common fixes (restart services, clear cache)
    • Added distributed tracing
  • Impact: 92% reduction in MTTR, 75% reduction in customer complaints

E-commerce Platform

  • Before: 2-hour MTTR, frequent escalations
  • After: 35-minute MTTR, self-service recovery
  • Changes:
    • Improved monitoring (5x more metrics)
    • Real-time alerting vs. 5-minute delay
    • Automated traffic shifting to healthy regions
    • Weekly incident response drills
  • Impact: 71% reduction in MTTR, $2M annual savings from reduced downtime

Advanced Topics

Chaos Engineering

Test your recovery time with controlled failures:

# Randomly kill services to test recovery
def chaos_monkey():
    if random() < 0.01:  # 1% chance
        kill_random_instance()
        start_timer()
        monitor_recovery()

Automated Remediation

# Auto-heal common issues
rules:
  - trigger: high_memory_usage
    threshold: 90%
    action: restart_service

  - trigger: database_connection_pool_exhausted
    action: scale_up_connections

  - trigger: high_error_rate
    threshold: 5%
    duration: 5min
    action: rollback_deployment

Game Days

Practice incident response:

  • Monthly: Simulate incidents
  • Rotate: Different team members lead
  • Measure: Track MTTR during drills
  • Improve: Refine runbooks based on learnings

Balancing MTTR with Other Metrics

Ideal State:
✅ High Deployment Frequency
✅ Low Change Failure Rate
✅ Fast Lead Time
✅ Low MTTR

Common Trade-offs:
⚖️ Fast recovery vs. thorough investigation
⚖️ Quick fix vs. root cause resolution
⚖️ Speed vs. prevention

Best Practice:

  1. Recover quickly (MTTR)
  2. Investigate thoroughly after (post-mortem)
  3. Implement preventive measures (reduce future incidents)

Conclusion

MTTR measures your resilience—not just your reliability. Elite teams recover from incidents in under an hour through comprehensive monitoring, clear runbooks, automated remediation, and regular practice. Focus on reducing each component: faster detection, quicker response, streamlined investigation, easy rollback, and automated verification. Remember: incidents will happen. What matters is how quickly you recover and what you learn from each one.

Start Today:

  1. Instrument your incident tracking
  2. Calculate your current MTTR
  3. Find your biggest bottleneck
  4. Implement one improvement
  5. Measure the impact
  6. Repeat