Observability Stack Evolution: Building vs Buying Your Monitoring

The Observability Dilemma

Your VP of Engineering asks: "Should we build our own observability stack or pay for Datadog?"

It's not a simple question. Observability costs can spiral (Datadog bills hitting $500K+/year), but building your own means maintaining infrastructure while your competitors ship features.

This Wardley map shows you where each component sits in its evolution and helps you make better build-vs-buy decisions.

Reading This Map

The Observability Value Chain

Top (High Visibility): What users care about

Service Reliability (0.95) - The ultimate goal
Incident Response (0.90) - Detecting and fixing problems
Dashboards (0.85) - Visualizing system health

Bottom (Infrastructure): What users don't see but you depend on

Application Logs (0.35) - Raw data from your apps
Instrumentation Libraries (0.40) - How you capture data
Log Shipping (0.50) - Moving logs around

Evolution Insights

Commodity Territory (75-100%):

On-Call Management: Use PagerDuty or Opsgenie, don't build
Metrics Collection: OpenTelemetry or Telegraf are standard
Log Shipping: Fluentd/Logstash are mature, pick one
Application Logs: stdout/stderr is the standard

Product Phase (50-75%):

Metrics Database: Prometheus is the winner, InfluxDB for specific use cases
Dashboards: Grafana is standard, but consider managed (Grafana Cloud)
Log Aggregation: Depends on scale—ELK for DIY, Datadog/Splunk for SaaS

Custom/Emerging (25-50%):

Distributed Tracing: Still evolving, vendor lock-in risk high
Alert Rules: Anomaly detection is custom, thresholds are commodity

Genesis (0-25%):

Custom Scripts: If you're still here, you're in trouble (see warning marker)

Strategic Decisions

Decision 1: The "All-in-One" vs "Best-of-Breed" Trade-off

All-in-One (Datadog, New Relic, Dynatrace):

✅ Unified experience, correlated data
✅ Less operational overhead
❌ Expensive at scale ($30-50/host/month × 1000 hosts = $360K-600K/year)
❌ Vendor lock-in

Best-of-Breed (Prometheus + Grafana + ELK):

✅ Lower cost (mostly operational effort)
✅ Flexibility and control
❌ Integration complexity
❌ Team needs observability expertise

Recommendation:

Under 50 engineers: All-in-one (Datadog, Honeycomb)
50-200 engineers: Hybrid (Prometheus + Grafana Cloud + Datadog for APM)
200+ engineers: Best-of-breed with a platform team

Decision 2: Distributed Tracing - Build, Buy, or Wait?

Notice that Distributed Tracing (0.55 evolution) is still in the "Product" phase and moving. This is a trap.

Why it's risky:

Standards are still emerging (OpenTelemetry is winning but incomplete)
Vendor lock-in is high (Honeycomb, Lightstep, Datadog all have proprietary formats)
Cost models are unpredictable (per-span pricing can explode)

What to do:

If you don't have tracing: Start with OpenTelemetry + Jaeger (open source)
If you have vendor tracing: Don't migrate unless pain is severe
If building new: Bet on OpenTelemetry, defer storage decisions

Decision 3: Kill Your Custom Scripts

See that red "Custom Scripts" node at (0.85, 0.15)? That's technical debt with high visibility.

Why it's bad:

High inertia (people depend on it)
Low evolution (reinventing the wheel)
High visibility (critical to operations)

Migration plan:

Month 1: Audit what the scripts do
Month 2: Map each script to an OpenTelemetry equivalent
Month 3: Run both in parallel
Month 4: Sunset scripts, celebrate

Dependency Analysis

Critical Path: Logs

Service Reliability → Incident Response → Dashboards → Log Aggregation → Log Shipping → Application Logs

Risk: If log shipping breaks, incident response suffers.

Mitigation:

Use battle-tested shippers (Fluentd, Vector)
Buffer logs locally (don't rely on network)
Monitor the monitoring (meta-metrics on log lag)

Critical Path: Metrics

Service Reliability → Incident Response → Alerting → Alert Rules → Metrics Database → Metrics Collection → Instrumentation

Risk: If instrumentation is incomplete, alerts don't fire.

Mitigation:

Standardize on OpenTelemetry (auto-instrumentation where possible)
RED metrics (Rate, Errors, Duration) for every service
SLO-based alerting (not just thresholds)

Cost Optimization Strategies

Strategy 1: Tiered Storage

Most observability data loses value over time. Use tiered storage:

Hot (last 7 days): High-performance, expensive (Datadog, Prometheus)
Warm (7-90 days): Medium performance (S3 + Athena, VictoriaMetrics)
Cold (90+ days): Archive only (S3 Glacier)

Savings: 60-80% reduction in storage costs.

Strategy 2: Sampling

You don't need every trace or every log line.

Logs:

DEBUG logs: Sample 1-10% in production
INFO logs: Keep 100% but compress aggressively
ERROR logs: Keep 100%, alert on anomalies

Traces:

Happy path: Sample 1-5%
Errors: Keep 100%
Slow requests (p99+): Keep 100%

Savings: 70-90% reduction in trace ingestion costs.

Strategy 3: Open Source Where Possible

Good candidates for open source:

Metrics: Prometheus (proven at scale)
Dashboards: Grafana (better than most SaaS)
Tracing storage: Jaeger or Tempo (if you have the expertise)

Bad candidates:

Log aggregation at scale (ELK is operationally expensive)
Anomaly detection (still emerging, vendor solutions are better)
Cross-product correlation (Datadog's secret sauce)

Common Mistakes

Mistake 1: Building Your Own Metrics Database

Prometheus is free, battle-tested, and scales to billions of time series. Don't reinvent this.

Exception: You're running at Google/Meta scale and have a team of Ph.D.s. Even then, use VictoriaMetrics or M3DB (open source derivatives).

Mistake 2: Ignoring Cardinality

High-cardinality labels (user IDs, request IDs) in metrics will kill your database.

Rule of thumb:

Metrics: Low cardinality (under 100 unique values per label)
Logs: High cardinality is fine
Traces: High cardinality is the point

Mistake 3: Alert Fatigue

More alerts ≠ better reliability. Alert fatigue trains teams to ignore alerts.

Better approach:

SLO-based alerting (only alert on SLO burn rate)
Escalation policies (page only after multiple failures)
Consolidate alerts (one alert for "API is down" not 50 alerts for each endpoint)

Movement Recommendations

Next 6 Months

Migrate to OpenTelemetry: Replace custom instrumentation
Standardize Dashboards: Consolidate on Grafana or Datadog
Implement Log Sampling: Reduce volume by 70%
SLO-Based Alerting: Reduce noise by 80%

Next 12 Months

Distributed Tracing Rollout: Start with critical services
Tiered Storage: Move old data to cold storage
Cost Monitoring: Track observability spend per service
Runbook Automation: Auto-remediate common incidents

Long-Term (12+ Months)

Unified Observability Platform: Correlate metrics, logs, traces
AIOps: Anomaly detection and predictive alerting
Full Automation: Self-healing systems

How to Use This Map

Plot Your Stack: Where are your components today?
Identify Gaps: Missing distributed tracing? No structured logs?
Plan Migrations: Move components right (toward commodity)
Make Buy Decisions: Use products/commodities, build only in genesis/custom
Review Quarterly: The observability landscape evolves fast

Remember: The goal isn't perfect observability. It's good enough observability at a sustainable cost that lets you ship faster and sleep better.