Replacing Datadog: A Cheaper Observability Stack That Actually Works
If your Datadog bill has crossed $20-30k/month, self-hosting starts to make sense. Here's the stack I deploy and what to expect.
By Andrii Votiakov on
Datadog is great. It's also expensive enough that at a certain scale, self-hosting your observability stack pays back inside a quarter. The decision is mostly about how much engineering time you have, not whether the technology is ready. Before deciding to replace, it's worth checking whether Datadog cost optimisation within the platform can close most of the gap — sometimes you can halve the bill without migrating. And if you stay on Datadog but want to reduce your CloudWatch spend alongside it, those two often overlap on the same accounts.
Quick answer
Above roughly $20-30k/month in Datadog spend, a self-hosted Grafana + Prometheus/VictoriaMetrics + Loki + Tempo + OpenTelemetry stack typically costs $3-8k/month all-in (compute + storage + one engineer's time). Below that threshold, Datadog is usually cheaper net. Above it, you're paying for convenience.
When self-hosting makes sense
Run the math honestly:
| Datadog monthly | Self-hosted estimate | Verdict |
|---|---|---|
| < $5k | $4-6k | Stay on Datadog |
| $5-15k | $4-8k | Toss-up; depends on team |
| $15-30k | $5-10k | Self-host probably wins |
| $30k+ | $7-15k | Self-host wins clearly |
Add: factor in onboarding (6-12 weeks one engineer), ongoing operations (~20% of one engineer's time), and the cost of "I'm not Datadog support" risk.
The stack I deploy
Standard, boring, well-understood. No exotic pieces.
Metrics: Prometheus or VictoriaMetrics
Prometheus for under ~10M active series. Simple, well-documented, scales fine to mid-size.
VictoriaMetrics above that. Drop-in Prometheus replacement, dramatically cheaper at scale (single-binary single-node up to 100M+ series). Stores data with much better compression — 70-90% less disk than Prometheus.
For most companies replacing Datadog, VictoriaMetrics is the right choice. Single-binary single-instance handles enormous load.
Logs: Loki
Loki stores logs in object storage (S3/GCS/MinIO) and indexes only labels, not content. Cheaper than Elasticsearch by an order of magnitude for most workloads.
Trade-off: full-text search is slower than Elasticsearch. For most teams that's fine — you usually search by label (service, environment, level) and time range, not content.
Configure with sensible chunk and retention settings:
limits_config:
ingestion_rate_mb: 32
retention_period: 720h # 30 days
storage_config:
aws:
s3: s3://eu-west-1/loki-logs
s3forcepathstyle: true
Traces: Tempo or Jaeger
Tempo stores traces in object storage like Loki. Cheap, scales well, integrates with Grafana.
Jaeger if you already run it. Either works. Tempo is the easier pairing if you're going Grafana-native.
Visualisation + alerts: Grafana
Grafana ties it all together. Dashboards, alerts, exploration. Open-source, free, mature.
For team-shared dashboards across cloud regions, Grafana Cloud Free tier handles small teams (3 users, 10k metrics, 50 GB logs/traces) — useful entrypoint that grows into self-hosted as you grow.
Collectors: OpenTelemetry
OpenTelemetry Collector is the boring right answer for receiving telemetry. It's the vendor-neutral collector that talks to all of the above and to most SaaS observability tools (so you can dual-ship during migration).
Standard receiver/processor/exporter configuration in YAML, deployable as a sidecar, daemon, or gateway.
The migration playbook
Week 1-2: Stand up the stack
Deploy Grafana, VictoriaMetrics, Loki, Tempo on Kubernetes (Helm charts) or VMs. Persistence on S3-compatible storage. Total bring-up: ~10 days for an engineer who's done it before, ~3 weeks first time.
Week 3-4: Dual-ship
Configure OpenTelemetry to send to both Datadog and your new stack. No user impact, no risk. You'll find issues with parsing, label mappings, etc — fix them while Datadog is still authoritative.
Week 5-6: Migrate dashboards and alerts
Datadog dashboards don't import directly to Grafana. Two approaches:
- Manual recreate the 10-20 dashboards your team actually uses. Most "important" Datadog dashboards have unused panels.
- Programmatic — use a translator like
dd2grafana(community tool) for the bulk; manually fix the rest.
Alerts: same pattern. Recreate the critical 50-100 alerts. The rest were probably noise anyway.
Week 7-8: Cut over
Stop sending data to Datadog. Watch closely for missing data or alerts. Keep Datadog account in read-only for 30 days as a safety net.
Week 9: Cancel
Cancel Datadog. Buy your team coffee. Move on.
What you give up
Honest list:
- Out-of-the-box integrations. Datadog has hundreds. OpenTelemetry has most of them now but a few still need custom work.
- Watchdog / anomaly detection. Grafana ML alternatives exist but aren't as polished.
- Single pane of glass UX. Grafana is fine; not as smooth as Datadog's tightly-integrated UI.
- Vendor support. You're now responsible for the stack.
What you gain
- Predictable, capped monthly cost. No surprise bills from a new high-cardinality metric.
- Fully open standards. OpenTelemetry data is yours forever.
- No per-host or per-event pricing pressure. Add a hundred services without flinching.
- Better data sovereignty. Useful for EU/regulated workloads.
Realistic numbers
Recent client (Datadog $34k/month at peak):
- Self-hosted stack on EKS: 4 VictoriaMetrics + 3 Loki + 2 Tempo + Grafana
- Compute: ~$1,800/month (Spot for Loki/Tempo workers)
- S3 storage: ~$900/month (logs + traces, 30-day hot, IA after)
- 1 engineer ~20% time ongoing: ~$2,500/month equivalent
- Total all-in: $5,200/month
Saved: $28,800/month, $345k/year. Migration paid back in 4 weeks of operating.
If you want help running the math on whether this fits your shape — and the migration if it does — book a call.