Error Rate

Overview

Error Rate measures the percentage of requests that result in errors (4xx or 5xx HTTP status codes, exceptions, or application-level errors). It's a critical reliability metric that indicates system health, impacts user experience, and serves as an early warning system for incidents.

Why It Matters

User experience: Errors frustrate users and cause churn
Revenue impact: Errors in checkout/payment flows lose money
Early warning: Spikes indicate incidents before they escalate
Quality indicator: Reflects code quality and testing effectiveness
SLA compliance: Error rate thresholds in service agreements
Debugging: Helps identify and prioritize issues

Types of Errors

By HTTP Status Code

Client Errors (4xx):

400 Bad Request: Invalid request from client
401 Unauthorized: Authentication required
403 Forbidden: Insufficient permissions
404 Not Found: Resource doesn't exist
429 Too Many Requests: Rate limit exceeded

Server Errors (5xx):

500 Internal Server Error: Generic server error
502 Bad Gateway: Upstream service failure
503 Service Unavailable: Server overloaded/maintenance
504 Gateway Timeout: Upstream service timeout

By Impact

User-Facing Errors:

Failed page loads
Failed form submissions
Broken features
Payment failures

Background Errors:

Cron job failures
Queue processing errors
Webhook delivery failures
Scheduled task errors

How to Measure

Calculation

Error Rate = (Error Count ÷ Total Requests) × 100

Example (1-hour window):
Total requests:  1,000,000
Failed requests: 1,250
Error Rate:      0.125%

What Counts as an Error?

Include:

5xx server errors (all)
4xx client errors (except 404s, sometimes)
Unhandled exceptions
Timeouts
Failed database queries

Exclude (sometimes):

404 Not Found (may be user error, not system error)
401 Unauthorized (expected for unauthenticated requests)
429 Rate Limited (intentional throttling)

Segmentation

Track error rate by:

Endpoint: Which APIs are failing?
Status code: What types of errors?
Geography: Regional issues?
Client: Mobile vs web vs API
User: Specific accounts having issues?

Recommended Visualizations

1. Error Rate Trend

Best for: Real-time monitoring

📉 Error Rate Improvement

Sample data showing dramatic improvement from 0.85% to 0.08% error rate over 6 weeks. The green line represents the target threshold (0.1%). Error rate now consistently below target through improved error handling, input validation, and retry logic. Excellent progress.

2. Error Rate Gauge

Best for: Current status

🎯 Current Error Rate

Error Rate

0.08%

Excellent (8.0%)

0%1%

Current error rate of 0.08% is Excellent. Only 800 errors per million requests. This indicates robust error handling and high reliability. Continue monitoring for spikes and addressing root causes of remaining errors.

3. Error Types Breakdown

Best for: Identifying error patterns

🔍 Error Types Distribution

500 Internal Server Errors are the most common (45%), indicating server-side issues that need investigation. 503 Service Unavailable (28%) suggests capacity or dependency problems. 504 Gateway Timeouts (15%) point to slow upstream services. Priority: fix root causes of 500 errors through better error handling and testing.

Target Ranges

By Service Type

Service Type	Target	Acceptable	Critical
Critical APIs	< 0.01%	< 0.1%	> 1%
Standard APIs	< 0.1%	< 0.5%	> 2%
Background jobs	< 1%	< 5%	> 10%
Webhooks	< 5%	< 10%	> 20%

By Error Type

Error Type	Target	Notes
5xx errors	< 0.01%	Server-side, completely unacceptable
4xx errors	< 1%	Client-side, some expected
Timeouts	< 0.1%	Infrastructure issue
Exceptions	0%	Code bugs, should not happen

Industry Standards

E-commerce:

Checkout: < 0.01% (every error = lost revenue)
Product pages: < 0.1%
Search: < 0.5%

SaaS:

Core features: < 0.1%
Auth/login: < 0.01%
API: < 0.5%

FinTech:

Payments: < 0.001% (three zeros!)
Trading: < 0.01%
Account management: < 0.1%

How to Improve

1. Implement Proper Error Handling

# Bad: No error handling
def process_payment(amount):
    charge = stripe.charge.create(amount=amount)
    return charge

# Good: Comprehensive error handling
def process_payment(amount):
    try:
        charge = stripe.charge.create(amount=amount)
        return {'success': True, 'charge': charge}
    except stripe.CardError as e:
        # Card was declined
        log.warning(f'Card declined: {e}')
        return {'success': False, 'error': 'card_declined'}
    except stripe.APIConnectionError as e:
        # Network error
        log.error(f'Stripe connection error: {e}')
        return {'success': False, 'error': 'service_unavailable'}
    except Exception as e:
        # Unknown error
        log.error(f'Payment failed: {e}', exc_info=True)
        return {'success': False, 'error': 'internal_error'}

2. Input Validation

from pydantic import BaseModel, validator

class CreateUserRequest(BaseModel):
    email: str
    age: int
    username: str

    @validator('email')
    def validate_email(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email')
        return v

    @validator('age')
    def validate_age(cls, v):
        if v < 0 or v > 150:
            raise ValueError('Invalid age')
        return v

# Returns 400 Bad Request before hitting database
# Prevents 500 errors from invalid data

3. Retry Logic with Exponential Backoff

import time
from functools import wraps

def retry(max_attempts=3, backoff=2):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait_time = backoff ** attempt
                    time.sleep(wait_time)
            return None
        return wrapper
    return decorator

@retry(max_attempts=3)
def call_external_api():
    return requests.get('https://api.example.com')

4. Circuit Breakers

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_payment_api():
    response = requests.post('https://payments.example.com')
    if response.status_code >= 500:
        raise Exception('Service error')
    return response

# After 5 failures, circuit opens
# Returns error immediately for 60 seconds
# Prevents cascading failures

5. Graceful Degradation

def get_user_profile(user_id):
    try:
        profile = cache.get(f'profile:{user_id}')
        if profile:
            return profile
    except RedisError:
        # Cache failed, continue anyway
        log.warning('Cache unavailable, fetching from DB')

    try:
        profile = db.query(User).get(user_id)
        return profile
    except DatabaseError as e:
        # Database failed, return minimal data
        log.error(f'DB error: {e}')
        return {'id': user_id, 'name': 'Unknown', 'status': 'error'}

# Service stays up even if cache and DB fail

6. Monitoring and Alerting

# DataDog metric
statsd.increment('api.request.error', tags=[
    f'endpoint:{request.path}',
    f'status:{response.status_code}',
    f'error_type:{error.__class__.__name__}'
])

# Alert when error rate > 1% for 5 minutes

7. Error Budgets

Error Budget = (1 - SLA) × Total Requests

Example:
SLA: 99.9% uptime
Error Budget: 0.1%

For 1M requests/day:
Error Budget = 1,000 errors/day

If you exceed budget: Freeze feature releases,
focus on reliability

Common Pitfalls

❌ Alerting on Every Error

Problem: Alert fatigue, important errors missed Solution: Alert on error rate > threshold, not individual errors

❌ Not Categorizing Errors

Problem: Client errors (4xx) mixed with server errors (5xx) Solution: Track separately, different targets for each

❌ Ignoring Context

Problem: 1% error rate on low-traffic endpoint vs high-traffic Solution: Consider absolute error count and user impact

❌ No Error Tracking for Background Jobs

Problem: Silent failures go unnoticed Solution: Track and alert on background job failures

❌ Treating All 4xx as Errors

Problem: 404s and 401s inflate error rate Solution: Exclude expected 4xx codes from alerts

Implementation Guide

Week 1: Instrumentation

// Express.js error tracking
app.use((err, req, res, next) => {
  // Log error
  console.error(err.stack);

  // Send to error tracking (Sentry)
  Sentry.captureException(err);

  // Track metric
  statsd.increment('api.error', {
    endpoint: req.path,
    status: err.status || 500
  });

  // Return error response
  res.status(err.status || 500).json({
    error: process.env.NODE_ENV === 'production'
      ? 'Internal server error'
      : err.message
  });
});

Week 2: Dashboards

# DataDog monitor
name: "High Error Rate"
type: metric alert
query: "sum(last_5m):sum:api.error{*}.as_count() / sum:api.request{*}.as_count() > 0.01"
message: |
  Error rate exceeds 1% for 5 minutes
  Current rate: {{value}}%
notify:
  - "@slack-eng"
  - "@pagerduty"

Week 3: Error Investigation Process

Alert fires → Error rate > 1%
Check dashboard → Which endpoints?
Check logs → What errors?
Check recent changes → New deployment?
Mitigate → Rollback or hotfix
Post-mortem → Root cause, prevention

Week 4: Continuous Improvement

Review top errors weekly
Fix highest-frequency errors
Add tests to prevent regressions
Update error handling patterns

Dashboard Example

Operations View

┌──────────────────────────────────────────────┐
│ Error Rate                                    │
│ Current: 0.08%                ✓ Good          │
│ ████████████████████████████░░ Under 0.1%    │
│                                              │
│ Last Hour:                                   │
│ • Total Requests: 1,250,000                  │
│ • Errors: 1,000                              │
│ • Error Rate: 0.08%                          │
│                                              │
│ Top Errors:                                  │
│ • 500 /api/reports     450 (45%)            │
│ • 503 /api/search      280 (28%)            │
│ • 504 /api/webhooks    150 (15%)            │
│ • 400 /api/users       120 (12%)            │
└──────────────────────────────────────────────┘

Detailed Breakdown

Error Distribution (Last 24 Hours)
────────────────────────────────────────────
Status  Count    %      Endpoint
────────────────────────────────────────────
500     8,450   (42%)  /api/reports
503     5,220   (26%)  /api/search
504     2,890   (14%)  /api/webhooks
400     2,150   (11%)  /api/users
502     1,420   (7%)   /api/payments
────────────────────────────────────────────
Total   20,130  (0.11%)

Trend: ↑ +15% vs. yesterday
Alert: ⚠️ /api/reports needs investigation

API Response Time: Slow requests often become errors
System Uptime: High error rate indicates downtime
Change Failure Rate: Deployments causing errors
MTTR: How fast you fix errors
User Satisfaction: Errors correlate with NPS

Tools & Integrations

Error Tracking

Sentry: Real-time error tracking and alerting
Rollbar: Error monitoring for all platforms
Bugsnag: Error monitoring with release tracking
Raygun: Error tracking + RUM

APM with Error Tracking

DataDog: Full-stack monitoring
New Relic: Application performance
Dynatrace: AI-powered monitoring
AppDynamics: Business transaction monitoring

Log Aggregation

ELK Stack: Elasticsearch + Logstash + Kibana
Splunk: Enterprise log management
Sumo Logic: Cloud-native log analytics

Questions to Ask

For Operations

What's causing the most errors?
Are errors concentrated in specific endpoints?
Is error rate trending up or down?
Do errors correlate with deployments?

For Engineering

Are we handling errors gracefully?
Do we have proper input validation?
Are we monitoring all error types?
Do we retry transient failures?

For Leadership

Is error rate impacting customer satisfaction?
Do we need to invest in reliability?
Are we meeting our SLAs?
What's the business impact of errors?

Success Stories

SaaS Platform

Before: 2.5% error rate, customer complaints
After: 0.05% error rate, NPS +30 points
Changes:
- Added comprehensive input validation
- Implemented circuit breakers for external services
- Proper error handling in all endpoints
- Automated error alerting and response
Impact: 98% error reduction, customer satisfaction dramatically improved

E-commerce Site

Before: 0.8% error rate in checkout, $100K/month lost
After: 0.02% error rate, revenue recovered
Changes:
- Retry logic for payment API
- Better error messages to users
- Fallback payment methods
- Real-time error monitoring
Impact: 97.5% error reduction, $96K monthly revenue recovered

Conclusion

Error Rate is a critical reliability metric that directly impacts user experience and revenue. Target < 0.1% for production APIs, < 0.01% for critical services. Reduce errors through proper error handling, input validation, retry logic, circuit breakers, and graceful degradation. Monitor error rate in real-time, alert on spikes, and investigate root causes promptly. Remember: every error is a potential lost customer—invest in reliability.

Quick Start:

Instrument error tracking in all endpoints
Set up error rate dashboard
Create alerts for > 1% error rate
Categorize errors by type and endpoint
Establish error investigation process
Fix top errors systematically

Overview

Why It Matters

Types of Errors

By HTTP Status Code

By Impact

How to Measure

Calculation

What Counts as an Error?

Segmentation

Recommended Visualizations

1. Error Rate Trend

📉 Error Rate Improvement

2. Error Rate Gauge

🎯 Current Error Rate

Error Rate

3. Error Types Breakdown

🔍 Error Types Distribution

Target Ranges

By Service Type

By Error Type

Industry Standards

How to Improve

1. Implement Proper Error Handling

2. Input Validation

3. Retry Logic with Exponential Backoff

4. Circuit Breakers

5. Graceful Degradation

6. Monitoring and Alerting

7. Error Budgets

Common Pitfalls

❌ Alerting on Every Error

❌ Not Categorizing Errors

❌ Ignoring Context

❌ No Error Tracking for Background Jobs

❌ Treating All 4xx as Errors

Implementation Guide

Week 1: Instrumentation

Week 2: Dashboards

Week 3: Error Investigation Process

Week 4: Continuous Improvement

Dashboard Example

Operations View

Detailed Breakdown

Related Metrics

Tools & Integrations

Error Tracking

APM with Error Tracking

Log Aggregation

Questions to Ask

For Operations

For Engineering

For Leadership

Success Stories

SaaS Platform

E-commerce Site

Conclusion

Related Content

Change Failure Rate

Mean Time to Recovery (MTTR)