Change Failure Rate
Track the percentage of deployments that result in failures, rollbacks, or hotfixes. Essential for balancing speed with stability.
Overview
Change Failure Rate measures the percentage of deployments that cause a failure in production, requiring remediation (rollback, hotfix, incident response). It's a critical DORA metric that balances velocity with stability.
Why It Matters
- Quality indicator: Shows if you're shipping too fast
- Risk management: High CFR means production incidents
- Customer trust: Frequent failures erode confidence
- Cost control: Failed deployments are expensive
- Team morale: Constant firefighting burns out teams
- Process health: Reveals gaps in testing or review
How to Measure
What Counts as a Failure?
Include:
- ✅ Production incidents caused by deployment
- ✅ Rollbacks to previous version
- ✅ Hotfixes deployed within 24 hours
- ✅ Degraded performance requiring intervention
- ✅ Security vulnerabilities requiring immediate patch
Exclude:
- ❌ Planned maintenance
- ❌ Expected behavior changes
- ❌ Non-production environment issues
- ❌ Issues detected before customer impact
- ❌ Failed deployment attempts (didn't reach production)
Calculation
Change Failure Rate = (Failed Deployments ÷ Total Deployments) × 100
Example:
- Total deployments: 100
- Failed deployments: 12
- CFR: 12%
Time Window
Define "failure" with a time window:
- Immediate: Issues within 1 hour (most strict)
- Same day: Issues within 24 hours (recommended)
- Weekly: Issues within 7 days (too lenient)
Recommended Visualizations
1. Line Chart (Trend Over Time)
Best for: Tracking improvement initiatives
Y-axis: Change Failure Rate (%)
X-axis: Time (weeks)
Target line: < 15% (Elite threshold)
Annotation: Mark major process changes
📉 Change Failure Rate Improvement
Sample data showing improvement from 28% to 12% over 6 months through better testing and progressive rollouts. The green line represents Elite threshold (< 15%).
2. Gauge (Current State)
Best for: Executive dashboards
Gauge ranges:
- Elite: 0-15% (green)
- High: 16-30% (blue)
- Medium: 31-45% (yellow)
- Low: 46-100% (red)
🎯 Current Performance
Change Failure Rate
Current CFR of 12% places the team in the Elite category. Continue monitoring and addressing root causes to maintain this level.
3. Stacked Bar Chart (Root Cause)
Best for: Identifying failure patterns
Y-axis: Number of failures
X-axis: Time periods
Stack colors: Failure types (config, code, infrastructure, etc.)
🔍 Failure Root Causes
Configuration errors are the leading cause (42%). Focus on configuration validation and testing to reduce failures further.
4. Scatter Plot (Speed vs. Quality)
Best for: Balancing velocity and stability
X-axis: Deployment Frequency
Y-axis: Change Failure Rate
Quadrants:
- Top-right: Fast but unstable ⚠️
- Bottom-right: Fast and stable ✅
- Top-left: Slow and unstable ❌
- Bottom-left: Slow but stable 🐢
Target Ranges (DORA Benchmarks)
| Performance Level | Change Failure Rate | |------------------|-------------------| | Elite | 0-15% | | High | 16-30% | | Medium | 31-45% | | Low | More than 45% |
Context-Specific Targets
By Industry:
- FinTech: < 10% (high stakes)
- SaaS: < 15% (standard)
- Internal tools: < 25% (more tolerance)
- Experimental products: < 30% (learning mode)
By Service Type:
- Critical path: < 5%
- Core features: < 15%
- New features: < 25%
- Experimental: < 35%
How to Improve
1. Strengthen Testing
Pre-Deployment:
Test Pyramid:
┌─────────────┐
│ Manual │ (5% - Critical paths only)
├─────────────┤
│ E2E Tests │ (15% - Happy paths)
├─────────────┤
│Integration │ (30% - Key workflows)
├─────────────┤
│ Unit Tests │ (50% - Business logic)
└─────────────┘
Strategies:
- Increase test coverage (aim for 80%+)
- Add integration tests for critical paths
- Implement contract testing for APIs
- Chaos engineering (controlled failures)
- Load testing before major releases
2. Improve Code Review
Review Checklist:
- [ ] Tests cover new code
- [ ] Error handling implemented
- [ ] Database migrations are reversible
- [ ] Feature flags for risky changes
- [ ] Rollback plan documented
- [ ] Monitoring and alerts configured
Process:
- Require 2 reviewers for infrastructure changes
- Dedicated security review for auth/payment code
- Architecture review for major changes
3. Progressive Rollouts
Deployment Strategy:
1. Deploy to internal (5 minutes)
2. Deploy to canary (10% of users, 30 minutes)
3. Deploy to 50% of users (1 hour)
4. Deploy to 100% of users
Rollback Triggers:
- Error rate > 1%
- P95 latency > 2x baseline
- Any critical error
- Manual trigger
4. Better Observability
Before Deployment:
- Ensure all new code has logging
- Add metrics for key operations
- Create dashboards for new features
- Set up alerts with thresholds
After Deployment:
- Monitor error rates for 30 minutes
- Watch key business metrics
- Check logs for unexpected patterns
- Verify alerts are working
5. Learn from Failures
Post-Incident Process:
- Immediate: Rollback or hotfix
- Same day: Blameless post-mortem
- Within 3 days: Write-up with action items
- Within 2 weeks: Implement preventive measures
Post-Mortem Template:
- What happened?
- Why did it happen?
- Why didn't we catch it?
- What are we changing?
- How will we verify the fix?
Common Pitfalls
❌ Defining "Failure" Inconsistently
Problem: Teams have different definitions of what counts Solution: Document clear criteria, automate detection
❌ Fear of Deployment
Problem: High CFR leads to less frequent deployments Solution: Fix the root cause, don't slow down deployments
❌ Ignoring Root Causes
Problem: Only tracking the number, not why failures happen Solution: Categorize failures (code, config, infra) and track separately
❌ Blaming Individuals
Problem: Using CFR to punish developers Solution: Blameless culture, focus on process improvements
❌ Perfect is the Enemy of Good
Problem: Aiming for 0% failure rate Solution: Accept that some failures are normal, focus on fast recovery
Implementation Guide
Week 1: Define & Track
-- Create failures tracking table
CREATE TABLE deployment_failures (
id SERIAL PRIMARY KEY,
deployment_id VARCHAR(255),
deployed_at TIMESTAMP,
failed_at TIMESTAMP,
failure_type VARCHAR(50), -- code, config, infra, etc.
severity VARCHAR(20), -- critical, high, medium, low
resolved_at TIMESTAMP,
rollback_required BOOLEAN,
root_cause TEXT
);
Week 2: Baseline
- Track all deployments for 2 weeks
- Document every failure with root cause
- Calculate baseline CFR
- Identify patterns (time of day, specific services)
Week 3: Quick Wins
- Add smoke tests for top 3 failure types
- Implement automated rollback for obvious failures
- Create deployment checklist
Week 4: Long-Term Improvements
- Start tracking CFR in team dashboards
- Include CFR in sprint retrospectives
- Set team-specific improvement goals
Dashboard Example
Executive View
┌──────────────────────────────────────────┐
│ Change Failure Rate: 12% │
│ ████████████░░░░░░░░░░░░░ Elite │
│ │
│ This Month: 12 failures / 100 deploys │
│ Last Month: 18 failures / 95 deploys │
│ Trend: ↓ 33% improvement │
└──────────────────────────────────────────┘
Team View - Failure Analysis
Failure Type Count % of Total Avg MTTR
────────────────────────────────────────────────
Configuration 5 42% 45 min
Code bugs 3 25% 2.5 hours
Database migration 2 17% 1 hour
Infrastructure 1 8% 30 min
Dependency issue 1 8% 4 hours
────────────────────────────────────────────────
Total 12 100% 1.8 hours
Related Metrics
Track these together for complete picture:
- Deployment Frequency: Are you deploying often despite failures?
- MTTR: How quickly do you recover from failures?
- Lead Time: Are you rushing changes?
- Code Coverage: Is insufficient testing the cause?
- Incident Count: Total production incidents (not just deployment-related)
Tools & Integrations
Incident Tracking
- PagerDuty: Incident management
- Opsgenie: Alert and on-call management
- FireHydrant: Incident response
- incident.io: Modern incident management
Monitoring & Alerts
- DataDog: Application monitoring
- New Relic: Performance monitoring
- Sentry: Error tracking
- Prometheus + Grafana: Self-hosted monitoring
Deployment Tracking
- LaunchDarkly: Feature flags + deployment tracking
- Split: Feature delivery platform
- Sleuth: DORA metrics platform
DIY Approach
# Track CFR from deployment and incident logs
import pandas as pd
deployments = pd.read_csv('deployments.csv') # deployment_id, timestamp
incidents = pd.read_csv('incidents.csv') # incident_id, deployment_id
# Count failures
failed_deployments = incidents['deployment_id'].nunique()
total_deployments = len(deployments)
cfr = (failed_deployments / total_deployments) * 100
print(f"Change Failure Rate: {cfr:.1f}%")
Questions to Ask
For Leadership
- Is our CFR trending in the right direction?
- Are certain teams or services outliers?
- Do we have adequate testing and monitoring?
- Are we learning from failures?
For Teams
- What are our most common failure types?
- Why didn't our tests catch these issues?
- Do we have the monitoring we need?
- Are we deploying at risky times (Friday afternoon)?
For Individuals
- Do I feel confident deploying my code?
- What would make me more confident?
- Have I seen my code fail in production?
- What did I learn from those failures?
Success Stories
E-commerce Platform
- Before: 35% CFR, frequent production incidents
- After: 8% CFR, stable deployments
- Changes:
- Implemented canary deployments
- Added comprehensive integration tests
- Required database migration testing
- Deployed during business hours only
- Impact: 77% reduction in failures, customer satisfaction up 25%
SaaS Startup
- Before: 42% CFR, team afraid to deploy
- After: 11% CFR, multiple daily deployments
- Changes:
- Blameless post-mortems for every failure
- Automated rollback based on error rate
- Increased test coverage from 45% to 85%
- Feature flags for all risky changes
- Impact: 74% reduction in CFR, deployment frequency up 10x
Advanced Topics
CFR by Change Size
Small changes (< 100 lines): 5% CFR
Medium changes (100-500 lines): 15% CFR
Large changes (> 500 lines): 40% CFR
Insight: Break large changes into smaller pieces
CFR by Time of Day
Business hours (9am-5pm): 18% CFR
Evening (5pm-11pm): 12% CFR
Night (11pm-9am): 35% CFR
Insight: Avoid late-night deployments, tired engineers make mistakes
CFR vs. Test Coverage
Correlation: -0.7 (strong negative)
< 50% coverage: 45% CFR
50-70% coverage: 25% CFR
70-90% coverage: 12% CFR
> 90% coverage: 6% CFR
Insight: Invest in test coverage
Balancing Speed and Stability
The goal is NOT to minimize CFR at all costs. The goal is to find the right balance:
Best Practices:
✅ Deploy frequently (velocity)
✅ Fail occasionally (learning)
✅ Recover quickly (resilience)
✅ Learn from failures (improvement)
Anti-Patterns:
❌ Deploy slowly to avoid failures
❌ Blame individuals for failures
❌ Hide or ignore failures
❌ Sacrifice quality for speed targets
Conclusion
Change Failure Rate is your quality check on deployment velocity. A CFR of 10-15% is healthy—it means you're moving fast while maintaining stability. Focus on fast detection and recovery rather than eliminating all failures. Implement progressive rollouts, improve testing, learn from every failure, and maintain a blameless culture. The best teams deploy often AND maintain low failure rates through automation, observability, and continuous improvement.