Replacing Datadog: A Cheaper Observability Stack That Actually Works

If your Datadog bill has crossed $20-30k/month, self-hosting starts to make sense. Here's the stack I deploy and what to expect.

By Andrii Votiakov on 2026-04-08

Datadog is great. It's also expensive enough that at a certain scale, self-hosting your observability stack pays back inside a quarter. The decision is mostly about how much engineering time you have, not whether the technology is ready. Before deciding to replace, it's worth checking whether Datadog cost optimisation within the platform can close most of the gap — sometimes you can halve the bill without migrating. And if you stay on Datadog but want to reduce your CloudWatch spend alongside it, those two often overlap on the same accounts.

Quick answer

Above roughly $20-30k/month in Datadog spend, a self-hosted Grafana + Prometheus/VictoriaMetrics + Loki + Tempo + OpenTelemetry stack typically costs $3-8k/month all-in (compute + storage + one engineer's time). Below that threshold, Datadog is usually cheaper net. Above it, you're paying for convenience.

When self-hosting makes sense

Run the math honestly:

Datadog monthly Self-hosted estimate Verdict
< $5k $4-6k Stay on Datadog
$5-15k $4-8k Toss-up; depends on team
$15-30k $5-10k Self-host probably wins
$30k+ $7-15k Self-host wins clearly

Add: factor in onboarding (6-12 weeks one engineer), ongoing operations (~20% of one engineer's time), and the cost of "I'm not Datadog support" risk.

The stack I deploy

Standard, boring, well-understood. No exotic pieces.

Metrics: Prometheus or VictoriaMetrics

Prometheus for under ~10M active series. Simple, well-documented, scales fine to mid-size.

VictoriaMetrics above that. Drop-in Prometheus replacement, dramatically cheaper at scale (single-binary single-node up to 100M+ series). Stores data with much better compression — 70-90% less disk than Prometheus.

For most companies replacing Datadog, VictoriaMetrics is the right choice. Single-binary single-instance handles enormous load.

Logs: Loki

Loki stores logs in object storage (S3/GCS/MinIO) and indexes only labels, not content. Cheaper than Elasticsearch by an order of magnitude for most workloads.

Trade-off: full-text search is slower than Elasticsearch. For most teams that's fine — you usually search by label (service, environment, level) and time range, not content.

Configure with sensible chunk and retention settings:

limits_config:
  ingestion_rate_mb: 32
  retention_period: 720h  # 30 days

storage_config:
  aws:
    s3: s3://eu-west-1/loki-logs
    s3forcepathstyle: true

Traces: Tempo or Jaeger

Tempo stores traces in object storage like Loki. Cheap, scales well, integrates with Grafana.

Jaeger if you already run it. Either works. Tempo is the easier pairing if you're going Grafana-native.

Visualisation + alerts: Grafana

Grafana ties it all together. Dashboards, alerts, exploration. Open-source, free, mature.

For team-shared dashboards across cloud regions, Grafana Cloud Free tier handles small teams (3 users, 10k metrics, 50 GB logs/traces) — useful entrypoint that grows into self-hosted as you grow.

Collectors: OpenTelemetry

OpenTelemetry Collector is the boring right answer for receiving telemetry. It's the vendor-neutral collector that talks to all of the above and to most SaaS observability tools (so you can dual-ship during migration).

Standard receiver/processor/exporter configuration in YAML, deployable as a sidecar, daemon, or gateway.

The migration playbook

Week 1-2: Stand up the stack

Deploy Grafana, VictoriaMetrics, Loki, Tempo on Kubernetes (Helm charts) or VMs. Persistence on S3-compatible storage. Total bring-up: ~10 days for an engineer who's done it before, ~3 weeks first time.

Week 3-4: Dual-ship

Configure OpenTelemetry to send to both Datadog and your new stack. No user impact, no risk. You'll find issues with parsing, label mappings, etc — fix them while Datadog is still authoritative.

Week 5-6: Migrate dashboards and alerts

Datadog dashboards don't import directly to Grafana. Two approaches:

  1. Manual recreate the 10-20 dashboards your team actually uses. Most "important" Datadog dashboards have unused panels.
  2. Programmatic — use a translator like dd2grafana (community tool) for the bulk; manually fix the rest.

Alerts: same pattern. Recreate the critical 50-100 alerts. The rest were probably noise anyway.

Week 7-8: Cut over

Stop sending data to Datadog. Watch closely for missing data or alerts. Keep Datadog account in read-only for 30 days as a safety net.

Week 9: Cancel

Cancel Datadog. Buy your team coffee. Move on.

What you give up

Honest list:

  • Out-of-the-box integrations. Datadog has hundreds. OpenTelemetry has most of them now but a few still need custom work.
  • Watchdog / anomaly detection. Grafana ML alternatives exist but aren't as polished.
  • Single pane of glass UX. Grafana is fine; not as smooth as Datadog's tightly-integrated UI.
  • Vendor support. You're now responsible for the stack.

What you gain

  • Predictable, capped monthly cost. No surprise bills from a new high-cardinality metric.
  • Fully open standards. OpenTelemetry data is yours forever.
  • No per-host or per-event pricing pressure. Add a hundred services without flinching.
  • Better data sovereignty. Useful for EU/regulated workloads.

Realistic numbers

Recent client (Datadog $34k/month at peak):

  • Self-hosted stack on EKS: 4 VictoriaMetrics + 3 Loki + 2 Tempo + Grafana
  • Compute: ~$1,800/month (Spot for Loki/Tempo workers)
  • S3 storage: ~$900/month (logs + traces, 30-day hot, IA after)
  • 1 engineer ~20% time ongoing: ~$2,500/month equivalent
  • Total all-in: $5,200/month

Saved: $28,800/month, $345k/year. Migration paid back in 4 weeks of operating.


If you want help running the math on whether this fits your shape — and the migration if it does — book a call.