CloudWatch Cost Optimisation: Logs, Metrics, Surprises
CloudWatch is the line item that surprises every engineering team. Here's where it actually goes and how to cut it 60-80% without losing visibility.
By Andrii Votiakov on
CloudWatch surprises engineering teams in the same way every quarter: a new service ships with default logging, the bill jumps a few thousand dollars, and nobody notices for two months. Then it gets investigated and turns out 90% of the spend is on logs nobody reads. If you're considering replacing CloudWatch for metrics and logs altogether, see replacing Datadog with a cheaper observability stack — the self-hosted alternative costs a fraction for teams already spending $20k+/month on Datadog.
Quick answer
CloudWatch is billed for ingestion ($0.50/GB), storage ($0.03/GB-mo), and queries. The single biggest line is almost always Logs ingestion. Set retention, drop debug-level noise, and use Log subscriptions to S3 + Athena for cold queries instead of paying CloudWatch storage.
What you're actually paying for
Five chargeable categories:
- Logs ingestion: $0.50/GB ingested
- Logs storage: $0.03/GB-month after ingestion
- Metrics: $0.30 per custom metric/month, plus API call costs
- Logs Insights queries: $0.005 per GB scanned
- Alarms: $0.10 per alarm/month (small but adds up at scale)
For most teams, Logs ingestion + storage is 70-90% of the CloudWatch bill.
The fixes that actually move the needle
1. Set retention on every log group
Default retention is "Never expire". A team I worked with had 14 TB of accumulated logs going back 4 years — nobody had ever queried any of it. Cleanup saved $4,200/month immediately.
# Audit log groups with no retention set
aws logs describe-log-groups \
--query 'logGroups[?!retentionInDays].[logGroupName,storedBytes]' \
--output table
Recommended retention by type:
| Log type | Retention |
|---|---|
| ALB/NLB access logs | 30 days |
| Application logs | 14-30 days |
| CloudTrail | 90 days (or send to S3 + Glacier) |
| VPC Flow Logs | 14 days (or S3 only) |
| Lambda logs | 14 days |
| Compliance/audit | per legal — usually S3, not CloudWatch |
2. Stop logging at DEBUG level in production
The single biggest ingestion cut. A typical Node app at INFO might emit 1 KB/request. At DEBUG it's 10-30 KB/request. At a million requests a day that's the difference between $15/month and $450/month — per service.
Turn it off. If you need debug for a specific incident, flip it on temporarily.
3. Filter what gets shipped, not what gets logged
Use a CloudWatch agent or Fluent Bit/Vector with filtering. Examples:
- Drop health-check log lines (
/health,/ready) before they hit CloudWatch - Drop Kubernetes liveness/readiness probe logs
- Sample HTTP 200 access logs to 10%, keep 100% of 4xx and 5xx
This is usually a 30-50% ingestion cut on web tiers alone.
4. Send cold logs to S3, query with Athena
CloudWatch Logs storage is $0.03/GB-month. S3 Standard is $0.023/GB-month. S3 Standard-IA is $0.0125. Glacier Instant is $0.004.
For logs you query rarely (security audits, compliance), the move is:
- 14 days in CloudWatch (hot, instant)
- After 14 days, subscription filter ships to S3 with lifecycle to IA → Glacier
- Query from Athena when needed (~$5 per TB scanned, infrequent)
This is the difference between paying $360/month for a TB of cold logs in CloudWatch vs $4/month in Glacier Instant.
5. Custom metrics: kill the cardinality
Each unique combination of metric name + dimensions = one custom metric = $0.30/month. Pay attention to:
- High-cardinality dimensions (user_id, request_id) — these are bill bombs. Use logs and Logs Insights instead.
- Per-pod metrics in Kubernetes when you have hundreds of pods — aggregate first.
- Container Insights with default cardinality — beware. Tunable via the agent config.
A single team I audited had 47,000 custom metrics, 90% of which were never queried. $14k/month in custom metrics alone.
6. Alarms on absent data
A common pattern: alarms on services that have been deleted. Each alarm is $0.10/month. Tiny individually, but I've seen accounts with 4,000+ orphan alarms = $400/month for nothing.
aws cloudwatch describe-alarms \
--query 'MetricAlarms[?StateValue==`INSUFFICIENT_DATA`].[AlarmName]'
Anything in INSUFFICIENT_DATA for over 30 days is probably dead.
7. Logs Insights queries — scan less
Each query costs $0.005/GB scanned. If you're running ad-hoc queries across full retention every day, that's a real number. Two practical moves:
- Always pin a time range. The default is 1 hour for a reason.
- Filter early, parse late.
filter @logStream like /api/beforeparse @message— Logs Insights respects ordering.
Common surprises
- Lambda functions logging entire event payloads — easy 5-10x ingestion increase. Strip before logging.
- CloudTrail data events for every S3 object access — hundreds of GB/day on busy accounts. Targeted, not blanket.
- Container Insights enabled cluster-wide without tuning — instant 5-figure addition to the bill on a big EKS cluster.
- Forgotten cross-account log replication — log shipping to a security account that was set up once and never reviewed.
What I check on a real audit
- Log groups without retention set (
retentionInDays = null) - Top 10 log groups by ingestion (Insights gives you this)
- Custom metric count and creator (CloudWatch Metrics → Metric Streams)
- Alarms in
INSUFFICIENT_DATAfor 30+ days - VPC Flow Logs going to CloudWatch (should be S3)
- CloudTrail data events scope
Realistic numbers
Recent client (~$8.5k/month CloudWatch):
- Setting retention everywhere: $2,200/month
- DEBUG → INFO across 6 services: $1,400/month
- VPC Flow Logs to S3: $650/month
- Custom metrics audit (deleted 18k unused): $1,800/month
- Orphan alarm cleanup: $120/month
Final: $2,330/month, 73% reduction.
If your CloudWatch bill has a mind of its own, book a call. Usually find half the savings within the first hour.