Datadog Cost Optimisation: Cardinality, Logs, and Custom Metrics
Datadog bills can rival the cloud bill they're supposed to monitor. Here's where the spend actually goes and how to cut it 50-70% without losing visibility.
By Andrii Votiakov on
Datadog is the observability stack that ate the world. It's also the one most likely to be your second-biggest cloud bill. After a hundred audits, the pattern is depressingly consistent: nobody understands the pricing, the team enables features by default, and the bill compounds quietly.
Quick answer
Datadog charges separately for infrastructure hosts, APM hosts, log ingestion ($0.10/GB) and log retention, custom metrics (every unique tag combination), synthetic tests, and RUM. Cardinality on custom metrics and log volume are usually 70%+ of the bill. Cut both and the rest is small.
Where the money actually goes
Common bill breakdown (typical mid-size SaaS):
| Line | % of total |
|---|---|
| Infrastructure | 15-25% |
| APM | 15-25% |
| Logs (ingestion + retention) | 25-40% |
| Custom Metrics (cardinality) | 15-30% |
| RUM, Synthetic, CI Visibility | 5-15% |
Custom metrics + logs is where 60-70% of all savings live. Start there.
Custom metrics: the cardinality trap
Datadog charges per unique time series — every distinct combination of metric name + tag values. A metric with 50 hosts × 100 endpoints × 5 status codes = 25,000 time series.
Easy ways to blow up cardinality:
- Tagging metrics with
request_id,user_id,trace_id(unique per request → unbounded cardinality) - Tagging with high-cardinality dimensions like
pathon a service that exposes 10,000 unique paths - Histogram metrics emitting per-pod, per-status, per-route, per-method tags — multiplies fast
- Per-pod metrics in Kubernetes with hundreds of pods
How to find offenders:
Datadog → Infrastructure → Metrics Summary → Sort by "Distinct Metrics"
Anything emitting > 10,000 distinct time series for a single metric is suspect. Drop high-cardinality tags or remove the metric.
Logs: ingestion + retention + indexing
Three separately priced log lines:
- Ingestion: $0.10/GB
- Indexing (15 days hot search): $1.70/million events
- Retention (rehydration): tiered
The savings:
Drop noise before ingestion
Use Datadog's Log Pipelines + Exclusion Filters. Drop:
- Health check probes (
/health,/ready) - Successful 200 access logs (sample to 5-10%)
- DEBUG-level lines in production
- Kubernetes audit log noise
This is usually 30-50% of ingestion gone.
Use Logging without Limits + Indexes only on important streams
Indexing is what makes logs searchable in real time. You don't need every log indexed for 15 days. Configure indexes by service:
- Production application logs: 15 days indexed
- Internal services: 7 days
- Batch jobs: 3 days
- Cold/security logs: 0 days indexed, just stored for rehydration
Forward cold logs to S3 (Live Tail not enabled)
If you need long-term retention for audit, ship to S3 from your collector instead of paying Datadog's retention tier. Standard log forwarder pattern with Vector or Fluent Bit. Cuts cost by 80%+ for cold logs.
APM: traces, ingestion, and indexing
APM is billed by ingested host-hours plus indexed spans. Sample aggressively:
- Default head-based sampling: keep 100% of error and slow traces, sample healthy ones at 1-5%
- Use Datadog's dynamic sampling where the agent decides based on rate (rather than client deciding randomly)
- Drop high-volume internal-only spans (e.g., DB-driver internal spans)
- Watch for span count multiplication on async/messaging architectures (1 user request → 50 spans is normal but expensive at high QPS)
Infrastructure hosts
Datadog charges per host-hour. Two common wastes:
- Dev and staging on full pricing. Use the Pro plan SKU on lower-priority environments only if your team needs it. Otherwise drop monitoring entirely on ephemeral preview environments.
- Pause monitoring on weekends/nights for dev clusters that scale to zero — agent should also stop reporting (kube-downscaler handles this).
RUM, Synthetics, CI
These are smaller line items, but quick checks:
- RUM: per-session pricing. Disable on internal admin pages and bot traffic.
- Synthetics: per-test-run pricing. Don't run a 1-minute test for a metric you check daily.
- CI Visibility: per-test pricing. Enable selectively on important pipelines, not every PR build.
The audit checklist I run
- Pull the last 30 days from Usage → Cost & Usage, group by product
- Metrics → Sort by Distinct Metrics — find the cardinality offenders
- Logs → Indexes — look at index size and retention; right-size each
- Logs → Pipelines — find the high-volume sources, add exclusion filters
- APM → Sampling Rules — confirm aggressive sampling is in place
- Infrastructure → Hosts — check non-prod hosts; consider tier downgrade
- Synthetics → Tests — kill anything you don't actually look at
What I usually find
- One service emitting a
user_idtag on a counter metric → 4M distinct time series → $3-8k/month - DEBUG logging in prod on the busy service → $1-4k/month in ingestion
- 30-day retention on logs nobody queries past 7 days → $1-2k/month
- 100% APM sampling on a high-QPS API → $2-5k/month in indexed spans
- Dev environment paying same per-host as prod → $500-2k/month
Realistic numbers
Recent SaaS client (~$28k/month Datadog):
- Custom metric cardinality cleanup: $5,400/month
- Log exclusion + retention by index: $6,200/month
- APM sampling 100% → 5% on healthy traces: $2,800/month
- Dropped Synthetic tests not in use: $600/month
- Dev environment downgraded: $1,100/month
Final: $11,900/month, ~58% reduction.
If you decide Datadog still isn't worth what's left, the alternative is self-hosted — see the Datadog replacement post.
Want me to audit your Datadog usage on a pay-for-savings basis? Book a call.