Observability Stack Evolution: Building vs Buying Your Monitoring
Map the evolution of observability tooling from custom scripts to SaaS platforms. Understand when to build, when to buy, and how to avoid the commodity trap.
Legend
The Observability Dilemma
Your VP of Engineering asks: "Should we build our own observability stack or pay for Datadog?"
It's not a simple question. Observability costs can spiral (Datadog bills hitting $500K+/year), but building your own means maintaining infrastructure while your competitors ship features.
This Wardley map shows you where each component sits in its evolution and helps you make better build-vs-buy decisions.
Reading This Map
The Observability Value Chain
Top (High Visibility): What users care about
- Service Reliability (0.95) - The ultimate goal
- Incident Response (0.90) - Detecting and fixing problems
- Dashboards (0.85) - Visualizing system health
Bottom (Infrastructure): What users don't see but you depend on
- Application Logs (0.35) - Raw data from your apps
- Instrumentation Libraries (0.40) - How you capture data
- Log Shipping (0.50) - Moving logs around
Evolution Insights
Commodity Territory (75-100%):
- On-Call Management: Use PagerDuty or Opsgenie, don't build
- Metrics Collection: OpenTelemetry or Telegraf are standard
- Log Shipping: Fluentd/Logstash are mature, pick one
- Application Logs: stdout/stderr is the standard
Product Phase (50-75%):
- Metrics Database: Prometheus is the winner, InfluxDB for specific use cases
- Dashboards: Grafana is standard, but consider managed (Grafana Cloud)
- Log Aggregation: Depends on scale—ELK for DIY, Datadog/Splunk for SaaS
Custom/Emerging (25-50%):
- Distributed Tracing: Still evolving, vendor lock-in risk high
- Alert Rules: Anomaly detection is custom, thresholds are commodity
Genesis (0-25%):
- Custom Scripts: If you're still here, you're in trouble (see warning marker)
Strategic Decisions
Decision 1: The "All-in-One" vs "Best-of-Breed" Trade-off
All-in-One (Datadog, New Relic, Dynatrace):
- ✅ Unified experience, correlated data
- ✅ Less operational overhead
- ❌ Expensive at scale ($30-50/host/month × 1000 hosts = $360K-600K/year)
- ❌ Vendor lock-in
Best-of-Breed (Prometheus + Grafana + ELK):
- ✅ Lower cost (mostly operational effort)
- ✅ Flexibility and control
- ❌ Integration complexity
- ❌ Team needs observability expertise
Recommendation:
- Under 50 engineers: All-in-one (Datadog, Honeycomb)
- 50-200 engineers: Hybrid (Prometheus + Grafana Cloud + Datadog for APM)
- 200+ engineers: Best-of-breed with a platform team
Decision 2: Distributed Tracing - Build, Buy, or Wait?
Notice that Distributed Tracing (0.55 evolution) is still in the "Product" phase and moving. This is a trap.
Why it's risky:
- Standards are still emerging (OpenTelemetry is winning but incomplete)
- Vendor lock-in is high (Honeycomb, Lightstep, Datadog all have proprietary formats)
- Cost models are unpredictable (per-span pricing can explode)
What to do:
- If you don't have tracing: Start with OpenTelemetry + Jaeger (open source)
- If you have vendor tracing: Don't migrate unless pain is severe
- If building new: Bet on OpenTelemetry, defer storage decisions
Decision 3: Kill Your Custom Scripts
See that red "Custom Scripts" node at (0.85, 0.15)? That's technical debt with high visibility.
Why it's bad:
- High inertia (people depend on it)
- Low evolution (reinventing the wheel)
- High visibility (critical to operations)
Migration plan:
- Month 1: Audit what the scripts do
- Month 2: Map each script to an OpenTelemetry equivalent
- Month 3: Run both in parallel
- Month 4: Sunset scripts, celebrate
Dependency Analysis
Critical Path: Logs
Service Reliability → Incident Response → Dashboards → Log Aggregation → Log Shipping → Application Logs
Risk: If log shipping breaks, incident response suffers.
Mitigation:
- Use battle-tested shippers (Fluentd, Vector)
- Buffer logs locally (don't rely on network)
- Monitor the monitoring (meta-metrics on log lag)
Critical Path: Metrics
Service Reliability → Incident Response → Alerting → Alert Rules → Metrics Database → Metrics Collection → Instrumentation
Risk: If instrumentation is incomplete, alerts don't fire.
Mitigation:
- Standardize on OpenTelemetry (auto-instrumentation where possible)
- RED metrics (Rate, Errors, Duration) for every service
- SLO-based alerting (not just thresholds)
Cost Optimization Strategies
Strategy 1: Tiered Storage
Most observability data loses value over time. Use tiered storage:
- Hot (last 7 days): High-performance, expensive (Datadog, Prometheus)
- Warm (7-90 days): Medium performance (S3 + Athena, VictoriaMetrics)
- Cold (90+ days): Archive only (S3 Glacier)
Savings: 60-80% reduction in storage costs.
Strategy 2: Sampling
You don't need every trace or every log line.
Logs:
- DEBUG logs: Sample 1-10% in production
- INFO logs: Keep 100% but compress aggressively
- ERROR logs: Keep 100%, alert on anomalies
Traces:
- Happy path: Sample 1-5%
- Errors: Keep 100%
- Slow requests (p99+): Keep 100%
Savings: 70-90% reduction in trace ingestion costs.
Strategy 3: Open Source Where Possible
Good candidates for open source:
- Metrics: Prometheus (proven at scale)
- Dashboards: Grafana (better than most SaaS)
- Tracing storage: Jaeger or Tempo (if you have the expertise)
Bad candidates:
- Log aggregation at scale (ELK is operationally expensive)
- Anomaly detection (still emerging, vendor solutions are better)
- Cross-product correlation (Datadog's secret sauce)
Common Mistakes
Mistake 1: Building Your Own Metrics Database
Prometheus is free, battle-tested, and scales to billions of time series. Don't reinvent this.
Exception: You're running at Google/Meta scale and have a team of Ph.D.s. Even then, use VictoriaMetrics or M3DB (open source derivatives).
Mistake 2: Ignoring Cardinality
High-cardinality labels (user IDs, request IDs) in metrics will kill your database.
Rule of thumb:
- Metrics: Low cardinality (under 100 unique values per label)
- Logs: High cardinality is fine
- Traces: High cardinality is the point
Mistake 3: Alert Fatigue
More alerts ≠ better reliability. Alert fatigue trains teams to ignore alerts.
Better approach:
- SLO-based alerting (only alert on SLO burn rate)
- Escalation policies (page only after multiple failures)
- Consolidate alerts (one alert for "API is down" not 50 alerts for each endpoint)
Movement Recommendations
Next 6 Months
- Migrate to OpenTelemetry: Replace custom instrumentation
- Standardize Dashboards: Consolidate on Grafana or Datadog
- Implement Log Sampling: Reduce volume by 70%
- SLO-Based Alerting: Reduce noise by 80%
Next 12 Months
- Distributed Tracing Rollout: Start with critical services
- Tiered Storage: Move old data to cold storage
- Cost Monitoring: Track observability spend per service
- Runbook Automation: Auto-remediate common incidents
Long-Term (12+ Months)
- Unified Observability Platform: Correlate metrics, logs, traces
- AIOps: Anomaly detection and predictive alerting
- Full Automation: Self-healing systems
How to Use This Map
- Plot Your Stack: Where are your components today?
- Identify Gaps: Missing distributed tracing? No structured logs?
- Plan Migrations: Move components right (toward commodity)
- Make Buy Decisions: Use products/commodities, build only in genesis/custom
- Review Quarterly: The observability landscape evolves fast
Remember: The goal isn't perfect observability. It's good enough observability at a sustainable cost that lets you ship faster and sleep better.