Datadog Cost Optimisation: Cardinality, Logs, and Custom Metrics

Datadog bills can rival the cloud bill they're supposed to monitor. Here's where the spend actually goes and how to cut it 50-70% without losing visibility.

By Andrii Votiakov on 2026-04-04

Datadog is the observability stack that ate the world. It's also the one most likely to be your second-biggest cloud bill. After a hundred audits, the pattern is depressingly consistent: nobody understands the pricing, the team enables features by default, and the bill compounds quietly.

Quick answer

Datadog charges separately for infrastructure hosts, APM hosts, log ingestion ($0.10/GB) and log retention, custom metrics (every unique tag combination), synthetic tests, and RUM. Cardinality on custom metrics and log volume are usually 70%+ of the bill. Cut both and the rest is small.

Where the money actually goes

Common bill breakdown (typical mid-size SaaS):

Line % of total
Infrastructure 15-25%
APM 15-25%
Logs (ingestion + retention) 25-40%
Custom Metrics (cardinality) 15-30%
RUM, Synthetic, CI Visibility 5-15%

Custom metrics + logs is where 60-70% of all savings live. Start there.

Custom metrics: the cardinality trap

Datadog charges per unique time series — every distinct combination of metric name + tag values. A metric with 50 hosts × 100 endpoints × 5 status codes = 25,000 time series.

Easy ways to blow up cardinality:

  • Tagging metrics with request_id, user_id, trace_id (unique per request → unbounded cardinality)
  • Tagging with high-cardinality dimensions like path on a service that exposes 10,000 unique paths
  • Histogram metrics emitting per-pod, per-status, per-route, per-method tags — multiplies fast
  • Per-pod metrics in Kubernetes with hundreds of pods

How to find offenders:

Datadog → Infrastructure → Metrics Summary → Sort by "Distinct Metrics"

Anything emitting > 10,000 distinct time series for a single metric is suspect. Drop high-cardinality tags or remove the metric.

Logs: ingestion + retention + indexing

Three separately priced log lines:

  1. Ingestion: $0.10/GB
  2. Indexing (15 days hot search): $1.70/million events
  3. Retention (rehydration): tiered

The savings:

Drop noise before ingestion

Use Datadog's Log Pipelines + Exclusion Filters. Drop:

  • Health check probes (/health, /ready)
  • Successful 200 access logs (sample to 5-10%)
  • DEBUG-level lines in production
  • Kubernetes audit log noise

This is usually 30-50% of ingestion gone.

Use Logging without Limits + Indexes only on important streams

Indexing is what makes logs searchable in real time. You don't need every log indexed for 15 days. Configure indexes by service:

  • Production application logs: 15 days indexed
  • Internal services: 7 days
  • Batch jobs: 3 days
  • Cold/security logs: 0 days indexed, just stored for rehydration

Forward cold logs to S3 (Live Tail not enabled)

If you need long-term retention for audit, ship to S3 from your collector instead of paying Datadog's retention tier. Standard log forwarder pattern with Vector or Fluent Bit. Cuts cost by 80%+ for cold logs.

APM: traces, ingestion, and indexing

APM is billed by ingested host-hours plus indexed spans. Sample aggressively:

  • Default head-based sampling: keep 100% of error and slow traces, sample healthy ones at 1-5%
  • Use Datadog's dynamic sampling where the agent decides based on rate (rather than client deciding randomly)
  • Drop high-volume internal-only spans (e.g., DB-driver internal spans)
  • Watch for span count multiplication on async/messaging architectures (1 user request → 50 spans is normal but expensive at high QPS)

Infrastructure hosts

Datadog charges per host-hour. Two common wastes:

  1. Dev and staging on full pricing. Use the Pro plan SKU on lower-priority environments only if your team needs it. Otherwise drop monitoring entirely on ephemeral preview environments.
  2. Pause monitoring on weekends/nights for dev clusters that scale to zero — agent should also stop reporting (kube-downscaler handles this).

RUM, Synthetics, CI

These are smaller line items, but quick checks:

  • RUM: per-session pricing. Disable on internal admin pages and bot traffic.
  • Synthetics: per-test-run pricing. Don't run a 1-minute test for a metric you check daily.
  • CI Visibility: per-test pricing. Enable selectively on important pipelines, not every PR build.

The audit checklist I run

  1. Pull the last 30 days from Usage → Cost & Usage, group by product
  2. Metrics → Sort by Distinct Metrics — find the cardinality offenders
  3. Logs → Indexes — look at index size and retention; right-size each
  4. Logs → Pipelines — find the high-volume sources, add exclusion filters
  5. APM → Sampling Rules — confirm aggressive sampling is in place
  6. Infrastructure → Hosts — check non-prod hosts; consider tier downgrade
  7. Synthetics → Tests — kill anything you don't actually look at

What I usually find

  • One service emitting a user_id tag on a counter metric → 4M distinct time series → $3-8k/month
  • DEBUG logging in prod on the busy service → $1-4k/month in ingestion
  • 30-day retention on logs nobody queries past 7 days → $1-2k/month
  • 100% APM sampling on a high-QPS API → $2-5k/month in indexed spans
  • Dev environment paying same per-host as prod → $500-2k/month

Realistic numbers

Recent SaaS client (~$28k/month Datadog):

  • Custom metric cardinality cleanup: $5,400/month
  • Log exclusion + retention by index: $6,200/month
  • APM sampling 100% → 5% on healthy traces: $2,800/month
  • Dropped Synthetic tests not in use: $600/month
  • Dev environment downgraded: $1,100/month

Final: $11,900/month, ~58% reduction.

If you decide Datadog still isn't worth what's left, the alternative is self-hosted — see the Datadog replacement post.


Want me to audit your Datadog usage on a pay-for-savings basis? Book a call.