Error Rate
Track the percentage of failed requests. Critical for reliability, user experience, and incident detection.
Overview
Error Rate measures the percentage of requests that result in errors (4xx or 5xx HTTP status codes, exceptions, or application-level errors). It's a critical reliability metric that indicates system health, impacts user experience, and serves as an early warning system for incidents.
Why It Matters
- User experience: Errors frustrate users and cause churn
- Revenue impact: Errors in checkout/payment flows lose money
- Early warning: Spikes indicate incidents before they escalate
- Quality indicator: Reflects code quality and testing effectiveness
- SLA compliance: Error rate thresholds in service agreements
- Debugging: Helps identify and prioritize issues
Types of Errors
By HTTP Status Code
Client Errors (4xx):
400 Bad Request: Invalid request from client401 Unauthorized: Authentication required403 Forbidden: Insufficient permissions404 Not Found: Resource doesn't exist429 Too Many Requests: Rate limit exceeded
Server Errors (5xx):
500 Internal Server Error: Generic server error502 Bad Gateway: Upstream service failure503 Service Unavailable: Server overloaded/maintenance504 Gateway Timeout: Upstream service timeout
By Impact
User-Facing Errors:
- Failed page loads
- Failed form submissions
- Broken features
- Payment failures
Background Errors:
- Cron job failures
- Queue processing errors
- Webhook delivery failures
- Scheduled task errors
How to Measure
Calculation
Error Rate = (Error Count ÷ Total Requests) × 100
Example (1-hour window):
Total requests: 1,000,000
Failed requests: 1,250
Error Rate: 0.125%
What Counts as an Error?
Include:
- 5xx server errors (all)
- 4xx client errors (except 404s, sometimes)
- Unhandled exceptions
- Timeouts
- Failed database queries
Exclude (sometimes):
- 404 Not Found (may be user error, not system error)
- 401 Unauthorized (expected for unauthenticated requests)
- 429 Rate Limited (intentional throttling)
Segmentation
Track error rate by:
- Endpoint: Which APIs are failing?
- Status code: What types of errors?
- Geography: Regional issues?
- Client: Mobile vs web vs API
- User: Specific accounts having issues?
Recommended Visualizations
1. Error Rate Trend
Best for: Real-time monitoring
📉 Error Rate Improvement
Sample data showing dramatic improvement from 0.85% to 0.08% error rate over 6 weeks. The green line represents the target threshold (0.1%). Error rate now consistently below target through improved error handling, input validation, and retry logic. Excellent progress.
2. Error Rate Gauge
Best for: Current status
🎯 Current Error Rate
Error Rate
Current error rate of 0.08% is Excellent. Only 800 errors per million requests. This indicates robust error handling and high reliability. Continue monitoring for spikes and addressing root causes of remaining errors.
3. Error Types Breakdown
Best for: Identifying error patterns
🔍 Error Types Distribution
500 Internal Server Errors are the most common (45%), indicating server-side issues that need investigation. 503 Service Unavailable (28%) suggests capacity or dependency problems. 504 Gateway Timeouts (15%) point to slow upstream services. Priority: fix root causes of 500 errors through better error handling and testing.
Target Ranges
By Service Type
| Service Type | Target | Acceptable | Critical | |--------------|--------|-----------|----------| | Critical APIs | < 0.01% | < 0.1% | > 1% | | Standard APIs | < 0.1% | < 0.5% | > 2% | | Background jobs | < 1% | < 5% | > 10% | | Webhooks | < 5% | < 10% | > 20% |
By Error Type
| Error Type | Target | Notes | |-----------|--------|-------| | 5xx errors | < 0.01% | Server-side, completely unacceptable | | 4xx errors | < 1% | Client-side, some expected | | Timeouts | < 0.1% | Infrastructure issue | | Exceptions | 0% | Code bugs, should not happen |
Industry Standards
E-commerce:
- Checkout: < 0.01% (every error = lost revenue)
- Product pages: < 0.1%
- Search: < 0.5%
SaaS:
- Core features: < 0.1%
- Auth/login: < 0.01%
- API: < 0.5%
FinTech:
- Payments: < 0.001% (three zeros!)
- Trading: < 0.01%
- Account management: < 0.1%
How to Improve
1. Implement Proper Error Handling
# Bad: No error handling
def process_payment(amount):
charge = stripe.charge.create(amount=amount)
return charge
# Good: Comprehensive error handling
def process_payment(amount):
try:
charge = stripe.charge.create(amount=amount)
return {'success': True, 'charge': charge}
except stripe.CardError as e:
# Card was declined
log.warning(f'Card declined: {e}')
return {'success': False, 'error': 'card_declined'}
except stripe.APIConnectionError as e:
# Network error
log.error(f'Stripe connection error: {e}')
return {'success': False, 'error': 'service_unavailable'}
except Exception as e:
# Unknown error
log.error(f'Payment failed: {e}', exc_info=True)
return {'success': False, 'error': 'internal_error'}
2. Input Validation
from pydantic import BaseModel, validator
class CreateUserRequest(BaseModel):
email: str
age: int
username: str
@validator('email')
def validate_email(cls, v):
if '@' not in v:
raise ValueError('Invalid email')
return v
@validator('age')
def validate_age(cls, v):
if v < 0 or v > 150:
raise ValueError('Invalid age')
return v
# Returns 400 Bad Request before hitting database
# Prevents 500 errors from invalid data
3. Retry Logic with Exponential Backoff
import time
from functools import wraps
def retry(max_attempts=3, backoff=2):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
wait_time = backoff ** attempt
time.sleep(wait_time)
return None
return wrapper
return decorator
@retry(max_attempts=3)
def call_external_api():
return requests.get('https://api.example.com')
4. Circuit Breakers
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_payment_api():
response = requests.post('https://payments.example.com')
if response.status_code >= 500:
raise Exception('Service error')
return response
# After 5 failures, circuit opens
# Returns error immediately for 60 seconds
# Prevents cascading failures
5. Graceful Degradation
def get_user_profile(user_id):
try:
profile = cache.get(f'profile:{user_id}')
if profile:
return profile
except RedisError:
# Cache failed, continue anyway
log.warning('Cache unavailable, fetching from DB')
try:
profile = db.query(User).get(user_id)
return profile
except DatabaseError as e:
# Database failed, return minimal data
log.error(f'DB error: {e}')
return {'id': user_id, 'name': 'Unknown', 'status': 'error'}
# Service stays up even if cache and DB fail
6. Monitoring and Alerting
# DataDog metric
statsd.increment('api.request.error', tags=[
f'endpoint:{request.path}',
f'status:{response.status_code}',
f'error_type:{error.__class__.__name__}'
])
# Alert when error rate > 1% for 5 minutes
7. Error Budgets
Error Budget = (1 - SLA) × Total Requests
Example:
SLA: 99.9% uptime
Error Budget: 0.1%
For 1M requests/day:
Error Budget = 1,000 errors/day
If you exceed budget: Freeze feature releases,
focus on reliability
Common Pitfalls
❌ Alerting on Every Error
Problem: Alert fatigue, important errors missed Solution: Alert on error rate > threshold, not individual errors
❌ Not Categorizing Errors
Problem: Client errors (4xx) mixed with server errors (5xx) Solution: Track separately, different targets for each
❌ Ignoring Context
Problem: 1% error rate on low-traffic endpoint vs high-traffic Solution: Consider absolute error count and user impact
❌ No Error Tracking for Background Jobs
Problem: Silent failures go unnoticed Solution: Track and alert on background job failures
❌ Treating All 4xx as Errors
Problem: 404s and 401s inflate error rate Solution: Exclude expected 4xx codes from alerts
Implementation Guide
Week 1: Instrumentation
// Express.js error tracking
app.use((err, req, res, next) => {
// Log error
console.error(err.stack);
// Send to error tracking (Sentry)
Sentry.captureException(err);
// Track metric
statsd.increment('api.error', {
endpoint: req.path,
status: err.status || 500
});
// Return error response
res.status(err.status || 500).json({
error: process.env.NODE_ENV === 'production'
? 'Internal server error'
: err.message
});
});
Week 2: Dashboards
# DataDog monitor
name: "High Error Rate"
type: metric alert
query: "sum(last_5m):sum:api.error{*}.as_count() / sum:api.request{*}.as_count() > 0.01"
message: |
Error rate exceeds 1% for 5 minutes
Current rate: {{value}}%
notify:
- "@slack-eng"
- "@pagerduty"
Week 3: Error Investigation Process
- Alert fires → Error rate > 1%
- Check dashboard → Which endpoints?
- Check logs → What errors?
- Check recent changes → New deployment?
- Mitigate → Rollback or hotfix
- Post-mortem → Root cause, prevention
Week 4: Continuous Improvement
- Review top errors weekly
- Fix highest-frequency errors
- Add tests to prevent regressions
- Update error handling patterns
Dashboard Example
Operations View
┌──────────────────────────────────────────────┐
│ Error Rate │
│ Current: 0.08% ✓ Good │
│ ████████████████████████████░░ Under 0.1% │
│ │
│ Last Hour: │
│ • Total Requests: 1,250,000 │
│ • Errors: 1,000 │
│ • Error Rate: 0.08% │
│ │
│ Top Errors: │
│ • 500 /api/reports 450 (45%) │
│ • 503 /api/search 280 (28%) │
│ • 504 /api/webhooks 150 (15%) │
│ • 400 /api/users 120 (12%) │
└──────────────────────────────────────────────┘
Detailed Breakdown
Error Distribution (Last 24 Hours)
────────────────────────────────────────────
Status Count % Endpoint
────────────────────────────────────────────
500 8,450 (42%) /api/reports
503 5,220 (26%) /api/search
504 2,890 (14%) /api/webhooks
400 2,150 (11%) /api/users
502 1,420 (7%) /api/payments
────────────────────────────────────────────
Total 20,130 (0.11%)
Trend: ↑ +15% vs. yesterday
Alert: ⚠️ /api/reports needs investigation
Related Metrics
- API Response Time: Slow requests often become errors
- System Uptime: High error rate indicates downtime
- Change Failure Rate: Deployments causing errors
- MTTR: How fast you fix errors
- User Satisfaction: Errors correlate with NPS
Tools & Integrations
Error Tracking
- Sentry: Real-time error tracking and alerting
- Rollbar: Error monitoring for all platforms
- Bugsnag: Error monitoring with release tracking
- Raygun: Error tracking + RUM
APM with Error Tracking
- DataDog: Full-stack monitoring
- New Relic: Application performance
- Dynatrace: AI-powered monitoring
- AppDynamics: Business transaction monitoring
Log Aggregation
- ELK Stack: Elasticsearch + Logstash + Kibana
- Splunk: Enterprise log management
- Sumo Logic: Cloud-native log analytics
Questions to Ask
For Operations
- What's causing the most errors?
- Are errors concentrated in specific endpoints?
- Is error rate trending up or down?
- Do errors correlate with deployments?
For Engineering
- Are we handling errors gracefully?
- Do we have proper input validation?
- Are we monitoring all error types?
- Do we retry transient failures?
For Leadership
- Is error rate impacting customer satisfaction?
- Do we need to invest in reliability?
- Are we meeting our SLAs?
- What's the business impact of errors?
Success Stories
SaaS Platform
- Before: 2.5% error rate, customer complaints
- After: 0.05% error rate, NPS +30 points
- Changes:
- Added comprehensive input validation
- Implemented circuit breakers for external services
- Proper error handling in all endpoints
- Automated error alerting and response
- Impact: 98% error reduction, customer satisfaction dramatically improved
E-commerce Site
- Before: 0.8% error rate in checkout, $100K/month lost
- After: 0.02% error rate, revenue recovered
- Changes:
- Retry logic for payment API
- Better error messages to users
- Fallback payment methods
- Real-time error monitoring
- Impact: 97.5% error reduction, $96K monthly revenue recovered
Conclusion
Error Rate is a critical reliability metric that directly impacts user experience and revenue. Target < 0.1% for production APIs, < 0.01% for critical services. Reduce errors through proper error handling, input validation, retry logic, circuit breakers, and graceful degradation. Monitor error rate in real-time, alert on spikes, and investigate root causes promptly. Remember: every error is a potential lost customer—invest in reliability.
Quick Start:
- Instrument error tracking in all endpoints
- Set up error rate dashboard
- Create alerts for > 1% error rate
- Categorize errors by type and endpoint
- Establish error investigation process
- Fix top errors systematically