Skip to main content
Featured

Error Rate

November 10, 2025By The Art of CTO13 min read
...
metrics

Track the percentage of failed requests. Critical for reliability, user experience, and incident detection.

Type:reliability
Tracking: real-time
Difficulty:easy
Measurement: Failed requests ÷ Total requests × 100
Target Range: < 0.1% excellent | < 1% acceptable | > 5% critical
Recommended Visualizations:line-chart, gauge, stacked-bar-chart
Data Sources:Application logs, DataDog, Sentry, New Relic, CloudWatch

Overview

Error Rate measures the percentage of requests that result in errors (4xx or 5xx HTTP status codes, exceptions, or application-level errors). It's a critical reliability metric that indicates system health, impacts user experience, and serves as an early warning system for incidents.

Why It Matters

  • User experience: Errors frustrate users and cause churn
  • Revenue impact: Errors in checkout/payment flows lose money
  • Early warning: Spikes indicate incidents before they escalate
  • Quality indicator: Reflects code quality and testing effectiveness
  • SLA compliance: Error rate thresholds in service agreements
  • Debugging: Helps identify and prioritize issues

Types of Errors

By HTTP Status Code

Client Errors (4xx):

  • 400 Bad Request: Invalid request from client
  • 401 Unauthorized: Authentication required
  • 403 Forbidden: Insufficient permissions
  • 404 Not Found: Resource doesn't exist
  • 429 Too Many Requests: Rate limit exceeded

Server Errors (5xx):

  • 500 Internal Server Error: Generic server error
  • 502 Bad Gateway: Upstream service failure
  • 503 Service Unavailable: Server overloaded/maintenance
  • 504 Gateway Timeout: Upstream service timeout

By Impact

User-Facing Errors:

  • Failed page loads
  • Failed form submissions
  • Broken features
  • Payment failures

Background Errors:

  • Cron job failures
  • Queue processing errors
  • Webhook delivery failures
  • Scheduled task errors

How to Measure

Calculation

Error Rate = (Error Count ÷ Total Requests) × 100

Example (1-hour window):
Total requests:  1,000,000
Failed requests: 1,250
Error Rate:      0.125%

What Counts as an Error?

Include:

  • 5xx server errors (all)
  • 4xx client errors (except 404s, sometimes)
  • Unhandled exceptions
  • Timeouts
  • Failed database queries

Exclude (sometimes):

  • 404 Not Found (may be user error, not system error)
  • 401 Unauthorized (expected for unauthenticated requests)
  • 429 Rate Limited (intentional throttling)

Segmentation

Track error rate by:

  • Endpoint: Which APIs are failing?
  • Status code: What types of errors?
  • Geography: Regional issues?
  • Client: Mobile vs web vs API
  • User: Specific accounts having issues?

1. Error Rate Trend

Best for: Real-time monitoring

📉 Error Rate Improvement

Sample data showing dramatic improvement from 0.85% to 0.08% error rate over 6 weeks. The green line represents the target threshold (0.1%). Error rate now consistently below target through improved error handling, input validation, and retry logic. Excellent progress.

2. Error Rate Gauge

Best for: Current status

🎯 Current Error Rate

Error Rate

0.08%
Excellent (8.0%)
0%1%

Current error rate of 0.08% is Excellent. Only 800 errors per million requests. This indicates robust error handling and high reliability. Continue monitoring for spikes and addressing root causes of remaining errors.

3. Error Types Breakdown

Best for: Identifying error patterns

🔍 Error Types Distribution

500 Internal Server Errors are the most common (45%), indicating server-side issues that need investigation. 503 Service Unavailable (28%) suggests capacity or dependency problems. 504 Gateway Timeouts (15%) point to slow upstream services. Priority: fix root causes of 500 errors through better error handling and testing.

Target Ranges

By Service Type

| Service Type | Target | Acceptable | Critical | |--------------|--------|-----------|----------| | Critical APIs | < 0.01% | < 0.1% | > 1% | | Standard APIs | < 0.1% | < 0.5% | > 2% | | Background jobs | < 1% | < 5% | > 10% | | Webhooks | < 5% | < 10% | > 20% |

By Error Type

| Error Type | Target | Notes | |-----------|--------|-------| | 5xx errors | < 0.01% | Server-side, completely unacceptable | | 4xx errors | < 1% | Client-side, some expected | | Timeouts | < 0.1% | Infrastructure issue | | Exceptions | 0% | Code bugs, should not happen |

Industry Standards

E-commerce:

  • Checkout: < 0.01% (every error = lost revenue)
  • Product pages: < 0.1%
  • Search: < 0.5%

SaaS:

  • Core features: < 0.1%
  • Auth/login: < 0.01%
  • API: < 0.5%

FinTech:

  • Payments: < 0.001% (three zeros!)
  • Trading: < 0.01%
  • Account management: < 0.1%

How to Improve

1. Implement Proper Error Handling

# Bad: No error handling
def process_payment(amount):
    charge = stripe.charge.create(amount=amount)
    return charge

# Good: Comprehensive error handling
def process_payment(amount):
    try:
        charge = stripe.charge.create(amount=amount)
        return {'success': True, 'charge': charge}
    except stripe.CardError as e:
        # Card was declined
        log.warning(f'Card declined: {e}')
        return {'success': False, 'error': 'card_declined'}
    except stripe.APIConnectionError as e:
        # Network error
        log.error(f'Stripe connection error: {e}')
        return {'success': False, 'error': 'service_unavailable'}
    except Exception as e:
        # Unknown error
        log.error(f'Payment failed: {e}', exc_info=True)
        return {'success': False, 'error': 'internal_error'}

2. Input Validation

from pydantic import BaseModel, validator

class CreateUserRequest(BaseModel):
    email: str
    age: int
    username: str

    @validator('email')
    def validate_email(cls, v):
        if '@' not in v:
            raise ValueError('Invalid email')
        return v

    @validator('age')
    def validate_age(cls, v):
        if v < 0 or v > 150:
            raise ValueError('Invalid age')
        return v

# Returns 400 Bad Request before hitting database
# Prevents 500 errors from invalid data

3. Retry Logic with Exponential Backoff

import time
from functools import wraps

def retry(max_attempts=3, backoff=2):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    wait_time = backoff ** attempt
                    time.sleep(wait_time)
            return None
        return wrapper
    return decorator

@retry(max_attempts=3)
def call_external_api():
    return requests.get('https://api.example.com')

4. Circuit Breakers

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
def call_payment_api():
    response = requests.post('https://payments.example.com')
    if response.status_code >= 500:
        raise Exception('Service error')
    return response

# After 5 failures, circuit opens
# Returns error immediately for 60 seconds
# Prevents cascading failures

5. Graceful Degradation

def get_user_profile(user_id):
    try:
        profile = cache.get(f'profile:{user_id}')
        if profile:
            return profile
    except RedisError:
        # Cache failed, continue anyway
        log.warning('Cache unavailable, fetching from DB')

    try:
        profile = db.query(User).get(user_id)
        return profile
    except DatabaseError as e:
        # Database failed, return minimal data
        log.error(f'DB error: {e}')
        return {'id': user_id, 'name': 'Unknown', 'status': 'error'}

# Service stays up even if cache and DB fail

6. Monitoring and Alerting

# DataDog metric
statsd.increment('api.request.error', tags=[
    f'endpoint:{request.path}',
    f'status:{response.status_code}',
    f'error_type:{error.__class__.__name__}'
])

# Alert when error rate > 1% for 5 minutes

7. Error Budgets

Error Budget = (1 - SLA) × Total Requests

Example:
SLA: 99.9% uptime
Error Budget: 0.1%

For 1M requests/day:
Error Budget = 1,000 errors/day

If you exceed budget: Freeze feature releases,
focus on reliability

Common Pitfalls

❌ Alerting on Every Error

Problem: Alert fatigue, important errors missed Solution: Alert on error rate > threshold, not individual errors

❌ Not Categorizing Errors

Problem: Client errors (4xx) mixed with server errors (5xx) Solution: Track separately, different targets for each

❌ Ignoring Context

Problem: 1% error rate on low-traffic endpoint vs high-traffic Solution: Consider absolute error count and user impact

❌ No Error Tracking for Background Jobs

Problem: Silent failures go unnoticed Solution: Track and alert on background job failures

❌ Treating All 4xx as Errors

Problem: 404s and 401s inflate error rate Solution: Exclude expected 4xx codes from alerts

Implementation Guide

Week 1: Instrumentation

// Express.js error tracking
app.use((err, req, res, next) => {
  // Log error
  console.error(err.stack);

  // Send to error tracking (Sentry)
  Sentry.captureException(err);

  // Track metric
  statsd.increment('api.error', {
    endpoint: req.path,
    status: err.status || 500
  });

  // Return error response
  res.status(err.status || 500).json({
    error: process.env.NODE_ENV === 'production'
      ? 'Internal server error'
      : err.message
  });
});

Week 2: Dashboards

# DataDog monitor
name: "High Error Rate"
type: metric alert
query: "sum(last_5m):sum:api.error{*}.as_count() / sum:api.request{*}.as_count() > 0.01"
message: |
  Error rate exceeds 1% for 5 minutes
  Current rate: {{value}}%
notify:
  - "@slack-eng"
  - "@pagerduty"

Week 3: Error Investigation Process

  1. Alert fires → Error rate > 1%
  2. Check dashboard → Which endpoints?
  3. Check logs → What errors?
  4. Check recent changes → New deployment?
  5. Mitigate → Rollback or hotfix
  6. Post-mortem → Root cause, prevention

Week 4: Continuous Improvement

  • Review top errors weekly
  • Fix highest-frequency errors
  • Add tests to prevent regressions
  • Update error handling patterns

Dashboard Example

Operations View

┌──────────────────────────────────────────────┐
│ Error Rate                                    │
│ Current: 0.08%                ✓ Good          │
│ ████████████████████████████░░ Under 0.1%    │
│                                              │
│ Last Hour:                                   │
│ • Total Requests: 1,250,000                  │
│ • Errors: 1,000                              │
│ • Error Rate: 0.08%                          │
│                                              │
│ Top Errors:                                  │
│ • 500 /api/reports     450 (45%)            │
│ • 503 /api/search      280 (28%)            │
│ • 504 /api/webhooks    150 (15%)            │
│ • 400 /api/users       120 (12%)            │
└──────────────────────────────────────────────┘

Detailed Breakdown

Error Distribution (Last 24 Hours)
────────────────────────────────────────────
Status  Count    %      Endpoint
────────────────────────────────────────────
500     8,450   (42%)  /api/reports
503     5,220   (26%)  /api/search
504     2,890   (14%)  /api/webhooks
400     2,150   (11%)  /api/users
502     1,420   (7%)   /api/payments
────────────────────────────────────────────
Total   20,130  (0.11%)

Trend: ↑ +15% vs. yesterday
Alert: ⚠️ /api/reports needs investigation
  • API Response Time: Slow requests often become errors
  • System Uptime: High error rate indicates downtime
  • Change Failure Rate: Deployments causing errors
  • MTTR: How fast you fix errors
  • User Satisfaction: Errors correlate with NPS

Tools & Integrations

Error Tracking

  • Sentry: Real-time error tracking and alerting
  • Rollbar: Error monitoring for all platforms
  • Bugsnag: Error monitoring with release tracking
  • Raygun: Error tracking + RUM

APM with Error Tracking

  • DataDog: Full-stack monitoring
  • New Relic: Application performance
  • Dynatrace: AI-powered monitoring
  • AppDynamics: Business transaction monitoring

Log Aggregation

  • ELK Stack: Elasticsearch + Logstash + Kibana
  • Splunk: Enterprise log management
  • Sumo Logic: Cloud-native log analytics

Questions to Ask

For Operations

  • What's causing the most errors?
  • Are errors concentrated in specific endpoints?
  • Is error rate trending up or down?
  • Do errors correlate with deployments?

For Engineering

  • Are we handling errors gracefully?
  • Do we have proper input validation?
  • Are we monitoring all error types?
  • Do we retry transient failures?

For Leadership

  • Is error rate impacting customer satisfaction?
  • Do we need to invest in reliability?
  • Are we meeting our SLAs?
  • What's the business impact of errors?

Success Stories

SaaS Platform

  • Before: 2.5% error rate, customer complaints
  • After: 0.05% error rate, NPS +30 points
  • Changes:
    • Added comprehensive input validation
    • Implemented circuit breakers for external services
    • Proper error handling in all endpoints
    • Automated error alerting and response
  • Impact: 98% error reduction, customer satisfaction dramatically improved

E-commerce Site

  • Before: 0.8% error rate in checkout, $100K/month lost
  • After: 0.02% error rate, revenue recovered
  • Changes:
    • Retry logic for payment API
    • Better error messages to users
    • Fallback payment methods
    • Real-time error monitoring
  • Impact: 97.5% error reduction, $96K monthly revenue recovered

Conclusion

Error Rate is a critical reliability metric that directly impacts user experience and revenue. Target < 0.1% for production APIs, < 0.01% for critical services. Reduce errors through proper error handling, input validation, retry logic, circuit breakers, and graceful degradation. Monitor error rate in real-time, alert on spikes, and investigate root causes promptly. Remember: every error is a potential lost customer—invest in reliability.

Quick Start:

  1. Instrument error tracking in all endpoints
  2. Set up error rate dashboard
  3. Create alerts for > 1% error rate
  4. Categorize errors by type and endpoint
  5. Establish error investigation process
  6. Fix top errors systematically