Change Failure Rate

Overview

Change Failure Rate measures the percentage of deployments that cause a failure in production, requiring remediation (rollback, hotfix, incident response). It's a critical DORA metric that balances velocity with stability.

Why It Matters

Quality indicator: Shows if you're shipping too fast
Risk management: High CFR means production incidents
Customer trust: Frequent failures erode confidence
Cost control: Failed deployments are expensive
Team morale: Constant firefighting burns out teams
Process health: Reveals gaps in testing or review

How to Measure

What Counts as a Failure?

Include:

✅ Production incidents caused by deployment
✅ Rollbacks to previous version
✅ Hotfixes deployed within 24 hours
✅ Degraded performance requiring intervention
✅ Security vulnerabilities requiring immediate patch

Exclude:

❌ Planned maintenance
❌ Expected behavior changes
❌ Non-production environment issues
❌ Issues detected before customer impact
❌ Failed deployment attempts (didn't reach production)

Calculation

Change Failure Rate = (Failed Deployments ÷ Total Deployments) × 100

Example:
- Total deployments: 100
- Failed deployments: 12
- CFR: 12%

Time Window

Define "failure" with a time window:

Immediate: Issues within 1 hour (most strict)
Same day: Issues within 24 hours (recommended)
Weekly: Issues within 7 days (too lenient)

Recommended Visualizations

1. Line Chart (Trend Over Time)

Best for: Tracking improvement initiatives

Y-axis: Change Failure Rate (%)
X-axis: Time (weeks)
Target line: < 15% (Elite threshold)
Annotation: Mark major process changes

📉 Change Failure Rate Improvement

Sample data showing improvement from 28% to 12% over 6 months through better testing and progressive rollouts. The green line represents Elite threshold (< 15%).

2. Gauge (Current State)

Best for: Executive dashboards

Gauge ranges:
- Elite: 0-15% (green)
- High: 16-30% (blue)
- Medium: 31-45% (yellow)
- Low: 46-100% (red)

🎯 Current Performance

Change Failure Rate

12%

Elite (12.0%)

0%100%

Current CFR of 12% places the team in the Elite category. Continue monitoring and addressing root causes to maintain this level.

3. Stacked Bar Chart (Root Cause)

Best for: Identifying failure patterns

Y-axis: Number of failures
X-axis: Time periods
Stack colors: Failure types (config, code, infrastructure, etc.)

🔍 Failure Root Causes

Configuration errors are the leading cause (42%). Focus on configuration validation and testing to reduce failures further.

4. Scatter Plot (Speed vs. Quality)

Best for: Balancing velocity and stability

X-axis: Deployment Frequency
Y-axis: Change Failure Rate
Quadrants:
- Top-right: Fast but unstable ⚠️
- Bottom-right: Fast and stable ✅
- Top-left: Slow and unstable ❌
- Bottom-left: Slow but stable 🐢

Target Ranges (DORA Benchmarks)

Performance Level	Change Failure Rate
Elite	0-15%
High	16-30%
Medium	31-45%
Low	More than 45%

Context-Specific Targets

By Industry:

FinTech: < 10% (high stakes)
SaaS: < 15% (standard)
Internal tools: < 25% (more tolerance)
Experimental products: < 30% (learning mode)

By Service Type:

Critical path: < 5%
Core features: < 15%
New features: < 25%
Experimental: < 35%

How to Improve

1. Strengthen Testing

Pre-Deployment:

Test Pyramid:
┌─────────────┐
│   Manual    │ (5% - Critical paths only)
├─────────────┤
│  E2E Tests  │ (15% - Happy paths)
├─────────────┤
│Integration  │ (30% - Key workflows)
├─────────────┤
│ Unit Tests  │ (50% - Business logic)
└─────────────┘

Strategies:

Increase test coverage (aim for 80%+)
Add integration tests for critical paths
Implement contract testing for APIs
Chaos engineering (controlled failures)
Load testing before major releases

2. Improve Code Review

Review Checklist:

Tests cover new code
Error handling implemented
Database migrations are reversible
Feature flags for risky changes
Rollback plan documented
Monitoring and alerts configured

Process:

Require 2 reviewers for infrastructure changes
Dedicated security review for auth/payment code
Architecture review for major changes

3. Progressive Rollouts

Deployment Strategy:

1. Deploy to internal (5 minutes)
2. Deploy to canary (10% of users, 30 minutes)
3. Deploy to 50% of users (1 hour)
4. Deploy to 100% of users

Rollback Triggers:

Error rate > 1%
P95 latency > 2x baseline
Any critical error
Manual trigger

4. Better Observability

Before Deployment:

Ensure all new code has logging
Add metrics for key operations
Create dashboards for new features
Set up alerts with thresholds

After Deployment:

Monitor error rates for 30 minutes
Watch key business metrics
Check logs for unexpected patterns
Verify alerts are working

5. Learn from Failures

Post-Incident Process:

Immediate: Rollback or hotfix
Same day: Blameless post-mortem
Within 3 days: Write-up with action items
Within 2 weeks: Implement preventive measures

Post-Mortem Template:

What happened?
Why did it happen?
Why didn't we catch it?
What are we changing?
How will we verify the fix?

Common Pitfalls

❌ Defining "Failure" Inconsistently

Problem: Teams have different definitions of what counts Solution: Document clear criteria, automate detection

❌ Fear of Deployment

Problem: High CFR leads to less frequent deployments Solution: Fix the root cause, don't slow down deployments

❌ Ignoring Root Causes

Problem: Only tracking the number, not why failures happen Solution: Categorize failures (code, config, infra) and track separately

❌ Blaming Individuals

Problem: Using CFR to punish developers Solution: Blameless culture, focus on process improvements

❌ Perfect is the Enemy of Good

Problem: Aiming for 0% failure rate Solution: Accept that some failures are normal, focus on fast recovery

Implementation Guide

Week 1: Define & Track

-- Create failures tracking table
CREATE TABLE deployment_failures (
  id SERIAL PRIMARY KEY,
  deployment_id VARCHAR(255),
  deployed_at TIMESTAMP,
  failed_at TIMESTAMP,
  failure_type VARCHAR(50),  -- code, config, infra, etc.
  severity VARCHAR(20),       -- critical, high, medium, low
  resolved_at TIMESTAMP,
  rollback_required BOOLEAN,
  root_cause TEXT
);

Week 2: Baseline

Track all deployments for 2 weeks
Document every failure with root cause
Calculate baseline CFR
Identify patterns (time of day, specific services)

Week 3: Quick Wins

Add smoke tests for top 3 failure types
Implement automated rollback for obvious failures
Create deployment checklist

Week 4: Long-Term Improvements

Start tracking CFR in team dashboards
Include CFR in sprint retrospectives
Set team-specific improvement goals

Dashboard Example

Executive View

┌──────────────────────────────────────────┐
│ Change Failure Rate: 12%                 │
│ ████████████░░░░░░░░░░░░░ Elite          │
│                                          │
│ This Month: 12 failures / 100 deploys   │
│ Last Month: 18 failures / 95 deploys    │
│ Trend: ↓ 33% improvement                │
└──────────────────────────────────────────┘

Team View - Failure Analysis

Failure Type       Count   % of Total   Avg MTTR
────────────────────────────────────────────────
Configuration      5       42%          45 min
Code bugs          3       25%          2.5 hours
Database migration 2       17%          1 hour
Infrastructure     1       8%           30 min
Dependency issue   1       8%           4 hours
────────────────────────────────────────────────
Total              12      100%         1.8 hours

Track these together for complete picture:

Deployment Frequency: Are you deploying often despite failures?
MTTR: How quickly do you recover from failures?
Lead Time: Are you rushing changes?
Code Coverage: Is insufficient testing the cause?
Incident Count: Total production incidents (not just deployment-related)

Tools & Integrations

Incident Tracking

PagerDuty: Incident management
Opsgenie: Alert and on-call management
FireHydrant: Incident response
incident.io: Modern incident management

Monitoring & Alerts

DataDog: Application monitoring
New Relic: Performance monitoring
Sentry: Error tracking
Prometheus + Grafana: Self-hosted monitoring

Deployment Tracking

LaunchDarkly: Feature flags + deployment tracking
Split: Feature delivery platform
Sleuth: DORA metrics platform

DIY Approach

# Track CFR from deployment and incident logs
import pandas as pd

deployments = pd.read_csv('deployments.csv')  # deployment_id, timestamp
incidents = pd.read_csv('incidents.csv')      # incident_id, deployment_id

# Count failures
failed_deployments = incidents['deployment_id'].nunique()
total_deployments = len(deployments)
cfr = (failed_deployments / total_deployments) * 100

print(f"Change Failure Rate: {cfr:.1f}%")

Questions to Ask

For Leadership

Is our CFR trending in the right direction?
Are certain teams or services outliers?
Do we have adequate testing and monitoring?
Are we learning from failures?

For Teams

What are our most common failure types?
Why didn't our tests catch these issues?
Do we have the monitoring we need?
Are we deploying at risky times (Friday afternoon)?

For Individuals

Do I feel confident deploying my code?
What would make me more confident?
Have I seen my code fail in production?
What did I learn from those failures?

Success Stories

E-commerce Platform

Before: 35% CFR, frequent production incidents
After: 8% CFR, stable deployments
Changes:
- Implemented canary deployments
- Added comprehensive integration tests
- Required database migration testing
- Deployed during business hours only
Impact: 77% reduction in failures, customer satisfaction up 25%

SaaS Startup

Before: 42% CFR, team afraid to deploy
After: 11% CFR, multiple daily deployments
Changes:
- Blameless post-mortems for every failure
- Automated rollback based on error rate
- Increased test coverage from 45% to 85%
- Feature flags for all risky changes
Impact: 74% reduction in CFR, deployment frequency up 10x

Advanced Topics

CFR by Change Size

Small changes (< 100 lines): 5% CFR
Medium changes (100-500 lines): 15% CFR
Large changes (> 500 lines): 40% CFR

Insight: Break large changes into smaller pieces

CFR by Time of Day

Business hours (9am-5pm): 18% CFR
Evening (5pm-11pm): 12% CFR
Night (11pm-9am): 35% CFR

Insight: Avoid late-night deployments, tired engineers make mistakes

CFR vs. Test Coverage

Correlation: -0.7 (strong negative)
< 50% coverage: 45% CFR
50-70% coverage: 25% CFR
70-90% coverage: 12% CFR
> 90% coverage: 6% CFR

Insight: Invest in test coverage

Balancing Speed and Stability

The goal is NOT to minimize CFR at all costs. The goal is to find the right balance:

Best Practices:
✅ Deploy frequently (velocity)
✅ Fail occasionally (learning)
✅ Recover quickly (resilience)
✅ Learn from failures (improvement)

Anti-Patterns:
❌ Deploy slowly to avoid failures
❌ Blame individuals for failures
❌ Hide or ignore failures
❌ Sacrifice quality for speed targets

Conclusion

Change Failure Rate is your quality check on deployment velocity. A CFR of 10-15% is healthy—it means you're moving fast while maintaining stability. Focus on fast detection and recovery rather than eliminating all failures. Implement progressive rollouts, improve testing, learn from every failure, and maintain a blameless culture. The best teams deploy often AND maintain low failure rates through automation, observability, and continuous improvement.

Overview

Why It Matters

How to Measure

What Counts as a Failure?

Calculation

Time Window

Recommended Visualizations

1. Line Chart (Trend Over Time)

📉 Change Failure Rate Improvement

2. Gauge (Current State)

🎯 Current Performance

Change Failure Rate

3. Stacked Bar Chart (Root Cause)

🔍 Failure Root Causes

4. Scatter Plot (Speed vs. Quality)

Target Ranges (DORA Benchmarks)

Context-Specific Targets

How to Improve

1. Strengthen Testing

2. Improve Code Review

3. Progressive Rollouts

4. Better Observability

5. Learn from Failures

Common Pitfalls

❌ Defining "Failure" Inconsistently

❌ Fear of Deployment

❌ Ignoring Root Causes

❌ Blaming Individuals

❌ Perfect is the Enemy of Good

Implementation Guide

Week 1: Define & Track

Week 2: Baseline

Week 3: Quick Wins

Week 4: Long-Term Improvements

Dashboard Example

Executive View

Team View - Failure Analysis

Related Metrics

Tools & Integrations

Incident Tracking

Monitoring & Alerts

Deployment Tracking

DIY Approach

Questions to Ask

For Leadership

For Teams

For Individuals

Success Stories

E-commerce Platform

SaaS Startup

Advanced Topics

CFR by Change Size

CFR by Time of Day

CFR vs. Test Coverage

Balancing Speed and Stability

Conclusion

Related Content

Code Coverage

Deployment Frequency