Skip to main content
Featured

Change Failure Rate

November 10, 2025By The Art of CTO12 min read
...
metrics

Track the percentage of deployments that result in failures, rollbacks, or hotfixes. Essential for balancing speed with stability.

Type:quality
Tracking: weekly
Difficulty:medium
Measurement: Failed deployments ÷ total deployments × 100
Target Range: Elite: 0-15% | High: 16-30% | Medium: 31-45% | Low: > 45%
Recommended Visualizations:line-chart, gauge, stacked-bar-chart
Data Sources:CI/CD logs, PagerDuty, Incident management systems, Rollback tracking

Overview

Change Failure Rate measures the percentage of deployments that cause a failure in production, requiring remediation (rollback, hotfix, incident response). It's a critical DORA metric that balances velocity with stability.

Why It Matters

  • Quality indicator: Shows if you're shipping too fast
  • Risk management: High CFR means production incidents
  • Customer trust: Frequent failures erode confidence
  • Cost control: Failed deployments are expensive
  • Team morale: Constant firefighting burns out teams
  • Process health: Reveals gaps in testing or review

How to Measure

What Counts as a Failure?

Include:

  • ✅ Production incidents caused by deployment
  • ✅ Rollbacks to previous version
  • ✅ Hotfixes deployed within 24 hours
  • ✅ Degraded performance requiring intervention
  • ✅ Security vulnerabilities requiring immediate patch

Exclude:

  • ❌ Planned maintenance
  • ❌ Expected behavior changes
  • ❌ Non-production environment issues
  • ❌ Issues detected before customer impact
  • ❌ Failed deployment attempts (didn't reach production)

Calculation

Change Failure Rate = (Failed Deployments ÷ Total Deployments) × 100

Example:
- Total deployments: 100
- Failed deployments: 12
- CFR: 12%

Time Window

Define "failure" with a time window:

  • Immediate: Issues within 1 hour (most strict)
  • Same day: Issues within 24 hours (recommended)
  • Weekly: Issues within 7 days (too lenient)

1. Line Chart (Trend Over Time)

Best for: Tracking improvement initiatives

Y-axis: Change Failure Rate (%)
X-axis: Time (weeks)
Target line: < 15% (Elite threshold)
Annotation: Mark major process changes

📉 Change Failure Rate Improvement

Sample data showing improvement from 28% to 12% over 6 months through better testing and progressive rollouts. The green line represents Elite threshold (< 15%).

2. Gauge (Current State)

Best for: Executive dashboards

Gauge ranges:
- Elite: 0-15% (green)
- High: 16-30% (blue)
- Medium: 31-45% (yellow)
- Low: 46-100% (red)

🎯 Current Performance

Change Failure Rate

12%
Elite (12.0%)
0%100%

Current CFR of 12% places the team in the Elite category. Continue monitoring and addressing root causes to maintain this level.

3. Stacked Bar Chart (Root Cause)

Best for: Identifying failure patterns

Y-axis: Number of failures
X-axis: Time periods
Stack colors: Failure types (config, code, infrastructure, etc.)

🔍 Failure Root Causes

Configuration errors are the leading cause (42%). Focus on configuration validation and testing to reduce failures further.

4. Scatter Plot (Speed vs. Quality)

Best for: Balancing velocity and stability

X-axis: Deployment Frequency
Y-axis: Change Failure Rate
Quadrants:
- Top-right: Fast but unstable ⚠️
- Bottom-right: Fast and stable ✅
- Top-left: Slow and unstable ❌
- Bottom-left: Slow but stable 🐢

Target Ranges (DORA Benchmarks)

| Performance Level | Change Failure Rate | |------------------|-------------------| | Elite | 0-15% | | High | 16-30% | | Medium | 31-45% | | Low | More than 45% |

Context-Specific Targets

By Industry:

  • FinTech: < 10% (high stakes)
  • SaaS: < 15% (standard)
  • Internal tools: < 25% (more tolerance)
  • Experimental products: < 30% (learning mode)

By Service Type:

  • Critical path: < 5%
  • Core features: < 15%
  • New features: < 25%
  • Experimental: < 35%

How to Improve

1. Strengthen Testing

Pre-Deployment:

Test Pyramid:
┌─────────────┐
│   Manual    │ (5% - Critical paths only)
├─────────────┤
│  E2E Tests  │ (15% - Happy paths)
├─────────────┤
│Integration  │ (30% - Key workflows)
├─────────────┤
│ Unit Tests  │ (50% - Business logic)
└─────────────┘

Strategies:

  • Increase test coverage (aim for 80%+)
  • Add integration tests for critical paths
  • Implement contract testing for APIs
  • Chaos engineering (controlled failures)
  • Load testing before major releases

2. Improve Code Review

Review Checklist:

  • [ ] Tests cover new code
  • [ ] Error handling implemented
  • [ ] Database migrations are reversible
  • [ ] Feature flags for risky changes
  • [ ] Rollback plan documented
  • [ ] Monitoring and alerts configured

Process:

  • Require 2 reviewers for infrastructure changes
  • Dedicated security review for auth/payment code
  • Architecture review for major changes

3. Progressive Rollouts

Deployment Strategy:

1. Deploy to internal (5 minutes)
2. Deploy to canary (10% of users, 30 minutes)
3. Deploy to 50% of users (1 hour)
4. Deploy to 100% of users

Rollback Triggers:

  • Error rate > 1%
  • P95 latency > 2x baseline
  • Any critical error
  • Manual trigger

4. Better Observability

Before Deployment:

  • Ensure all new code has logging
  • Add metrics for key operations
  • Create dashboards for new features
  • Set up alerts with thresholds

After Deployment:

  • Monitor error rates for 30 minutes
  • Watch key business metrics
  • Check logs for unexpected patterns
  • Verify alerts are working

5. Learn from Failures

Post-Incident Process:

  1. Immediate: Rollback or hotfix
  2. Same day: Blameless post-mortem
  3. Within 3 days: Write-up with action items
  4. Within 2 weeks: Implement preventive measures

Post-Mortem Template:

  • What happened?
  • Why did it happen?
  • Why didn't we catch it?
  • What are we changing?
  • How will we verify the fix?

Common Pitfalls

❌ Defining "Failure" Inconsistently

Problem: Teams have different definitions of what counts Solution: Document clear criteria, automate detection

❌ Fear of Deployment

Problem: High CFR leads to less frequent deployments Solution: Fix the root cause, don't slow down deployments

❌ Ignoring Root Causes

Problem: Only tracking the number, not why failures happen Solution: Categorize failures (code, config, infra) and track separately

❌ Blaming Individuals

Problem: Using CFR to punish developers Solution: Blameless culture, focus on process improvements

❌ Perfect is the Enemy of Good

Problem: Aiming for 0% failure rate Solution: Accept that some failures are normal, focus on fast recovery

Implementation Guide

Week 1: Define & Track

-- Create failures tracking table
CREATE TABLE deployment_failures (
  id SERIAL PRIMARY KEY,
  deployment_id VARCHAR(255),
  deployed_at TIMESTAMP,
  failed_at TIMESTAMP,
  failure_type VARCHAR(50),  -- code, config, infra, etc.
  severity VARCHAR(20),       -- critical, high, medium, low
  resolved_at TIMESTAMP,
  rollback_required BOOLEAN,
  root_cause TEXT
);

Week 2: Baseline

  • Track all deployments for 2 weeks
  • Document every failure with root cause
  • Calculate baseline CFR
  • Identify patterns (time of day, specific services)

Week 3: Quick Wins

  • Add smoke tests for top 3 failure types
  • Implement automated rollback for obvious failures
  • Create deployment checklist

Week 4: Long-Term Improvements

  • Start tracking CFR in team dashboards
  • Include CFR in sprint retrospectives
  • Set team-specific improvement goals

Dashboard Example

Executive View

┌──────────────────────────────────────────┐
│ Change Failure Rate: 12%                 │
│ ████████████░░░░░░░░░░░░░ Elite          │
│                                          │
│ This Month: 12 failures / 100 deploys   │
│ Last Month: 18 failures / 95 deploys    │
│ Trend: ↓ 33% improvement                │
└──────────────────────────────────────────┘

Team View - Failure Analysis

Failure Type       Count   % of Total   Avg MTTR
────────────────────────────────────────────────
Configuration      5       42%          45 min
Code bugs          3       25%          2.5 hours
Database migration 2       17%          1 hour
Infrastructure     1       8%           30 min
Dependency issue   1       8%           4 hours
────────────────────────────────────────────────
Total              12      100%         1.8 hours

Track these together for complete picture:

  • Deployment Frequency: Are you deploying often despite failures?
  • MTTR: How quickly do you recover from failures?
  • Lead Time: Are you rushing changes?
  • Code Coverage: Is insufficient testing the cause?
  • Incident Count: Total production incidents (not just deployment-related)

Tools & Integrations

Incident Tracking

  • PagerDuty: Incident management
  • Opsgenie: Alert and on-call management
  • FireHydrant: Incident response
  • incident.io: Modern incident management

Monitoring & Alerts

  • DataDog: Application monitoring
  • New Relic: Performance monitoring
  • Sentry: Error tracking
  • Prometheus + Grafana: Self-hosted monitoring

Deployment Tracking

  • LaunchDarkly: Feature flags + deployment tracking
  • Split: Feature delivery platform
  • Sleuth: DORA metrics platform

DIY Approach

# Track CFR from deployment and incident logs
import pandas as pd

deployments = pd.read_csv('deployments.csv')  # deployment_id, timestamp
incidents = pd.read_csv('incidents.csv')      # incident_id, deployment_id

# Count failures
failed_deployments = incidents['deployment_id'].nunique()
total_deployments = len(deployments)
cfr = (failed_deployments / total_deployments) * 100

print(f"Change Failure Rate: {cfr:.1f}%")

Questions to Ask

For Leadership

  • Is our CFR trending in the right direction?
  • Are certain teams or services outliers?
  • Do we have adequate testing and monitoring?
  • Are we learning from failures?

For Teams

  • What are our most common failure types?
  • Why didn't our tests catch these issues?
  • Do we have the monitoring we need?
  • Are we deploying at risky times (Friday afternoon)?

For Individuals

  • Do I feel confident deploying my code?
  • What would make me more confident?
  • Have I seen my code fail in production?
  • What did I learn from those failures?

Success Stories

E-commerce Platform

  • Before: 35% CFR, frequent production incidents
  • After: 8% CFR, stable deployments
  • Changes:
    • Implemented canary deployments
    • Added comprehensive integration tests
    • Required database migration testing
    • Deployed during business hours only
  • Impact: 77% reduction in failures, customer satisfaction up 25%

SaaS Startup

  • Before: 42% CFR, team afraid to deploy
  • After: 11% CFR, multiple daily deployments
  • Changes:
    • Blameless post-mortems for every failure
    • Automated rollback based on error rate
    • Increased test coverage from 45% to 85%
    • Feature flags for all risky changes
  • Impact: 74% reduction in CFR, deployment frequency up 10x

Advanced Topics

CFR by Change Size

Small changes (< 100 lines): 5% CFR
Medium changes (100-500 lines): 15% CFR
Large changes (> 500 lines): 40% CFR

Insight: Break large changes into smaller pieces

CFR by Time of Day

Business hours (9am-5pm): 18% CFR
Evening (5pm-11pm): 12% CFR
Night (11pm-9am): 35% CFR

Insight: Avoid late-night deployments, tired engineers make mistakes

CFR vs. Test Coverage

Correlation: -0.7 (strong negative)
< 50% coverage: 45% CFR
50-70% coverage: 25% CFR
70-90% coverage: 12% CFR
> 90% coverage: 6% CFR

Insight: Invest in test coverage

Balancing Speed and Stability

The goal is NOT to minimize CFR at all costs. The goal is to find the right balance:

Best Practices:
✅ Deploy frequently (velocity)
✅ Fail occasionally (learning)
✅ Recover quickly (resilience)
✅ Learn from failures (improvement)

Anti-Patterns:
❌ Deploy slowly to avoid failures
❌ Blame individuals for failures
❌ Hide or ignore failures
❌ Sacrifice quality for speed targets

Conclusion

Change Failure Rate is your quality check on deployment velocity. A CFR of 10-15% is healthy—it means you're moving fast while maintaining stability. Focus on fast detection and recovery rather than eliminating all failures. Implement progressive rollouts, improve testing, learn from every failure, and maintain a blameless culture. The best teams deploy often AND maintain low failure rates through automation, observability, and continuous improvement.