CloudWatch alone isn't enough — it doesn't capture memory by default. Install the CloudWatch agent everywhere, or use Compute Optimizer if your account has 14+ days of data already. The four metrics that matter: - CPU: average and 95th percentile - Memory: max and 95th percentile (CW agent or Datadog) - Network: peak inbound + outbound - Disk: IOPS and throughput vs the EBS volume's limits

Indicative results from real engagements: - 200-instance fleet, mostly `m5.xlarge`: 38% reduction, $14k/month → $8.6k/month - 50-instance ECS fleet on `c5.large`: 41% reduction plus Graviton switch - 12-instance database tier: 22% reduction (database tiers are usually properly sized to start with) --- If you want this run on your fleet without the 14-day wait, book a call — I can typically find the savings in an afternoon.

EC2 Right-Sizing: My 14-Day Method to Cut 40%

Most EC2 fleets are 30-60% over-provisioned. Here's the exact 14-day procedure I run on every audit to cut the bill without breaking production.

By Andrii Votiakov on 2026-02-23

Right-sizing is the single biggest lever in any AWS audit. Almost every fleet I look at is over-provisioned by 30-60%, because instances were sized during launch under uncertainty and never revisited. Here's exactly how I find and fix it.

Quick answer

Pull 14 days of CPU, memory, network and disk metrics for every EC2 instance. Anything sitting under 20% average CPU and under 50% peak memory is a candidate to drop one or two sizes. Verify with two days at the new size in staging, then roll prod. Typical fleet savings: 30-45% of compute cost.

Why 14 days

Less than 14 days misses weekly patterns (Monday spike, weekend trough). More than 14 days lets seasonality blur into noise and includes deprecated workloads. Two clean weeks at current load is the sweet spot.

What to measure

CloudWatch alone isn't enough — it doesn't capture memory by default. Install the CloudWatch agent everywhere, or use Compute Optimizer if your account has 14+ days of data already.

The four metrics that matter:

CPU: average and 95th percentile
Memory: max and 95th percentile (CW agent or Datadog)
Network: peak inbound + outbound
Disk: IOPS and throughput vs the EBS volume's limits

The decision rules I use

Drop two sizes if:

Avg CPU < 10% and Max CPU < 30% and Memory P95 < 40%

Drop one size if:

Avg CPU < 20% and Memory P95 < 60%

Hold if:

Memory P95 > 75% (memory pressure causes worse user pain than CPU)
The instance bursts above 90% CPU more than five times in 14 days

Switch family (don't just resize) if:

Memory-to-CPU ratio is way off — e.g. you're on m5 but consistently CPU-bound, move to c7g
You're on x86 — strongly consider Graviton migration (m6g, c7g, r7g); typically 20-40% cheaper

The actual procedure

Day 0: Baseline

Tag every instance with right-sizing-baseline=2026-02-15 so you can correlate later
Snapshot Cost Explorer's last 30 days of EC2 cost per instance type
Pull CW Insights data for the metrics above

Days 1-14: Collect

Don't change anything. Let it run.
If you must scale, scale horizontally (add instances) so per-instance metrics stay representative.

Day 14: Analyse

Run a query like this in CloudWatch Insights or Athena over the metric data:

fields instanceid, avg(cpu), max(cpu), p95(memory_used_percent)
| stats avg(cpu_avg) as avg_cpu, max(cpu_max) as peak_cpu, max(p95_mem) as p95_mem
  by instanceid
| sort peak_cpu asc

Group instances into:

Drop two sizes (~25% of fleet on average)
Drop one size (~25%)
Switch family (~10-20%)
Hold (~30%)
Increase (rare, but always check; bad performance is also expensive)

Days 15-17: Stage

Resize one canary in staging per group. Run real load. Watch for OOM, throttling, latency regressions.

Days 18-21: Roll

Resize prod one ASG / one service at a time. Most modern AMIs handle a stop/start at smaller size cleanly. For ASGs, change the launch template and roll the group.

Day 28: Verify and lock in

Confirm new metrics look healthy (CPU now in 30-60% range is ideal)
Buy 1-year Compute Savings Plans for the new floor
Schedule the next right-sizing review for 90 days out

Common traps

Don't right-size before you've cleared zombie instances. Half the savings often come from finding 20 instances no one remembered.
Memory-bound workloads lie via CPU. A Java app at 8% CPU and 95% heap will still benefit from more memory, not less.
Burstable T-class metrics are different. CPU credit balance matters more than CPU percent. If credits are full and CPU is 8%, drop a size. If credits are draining, you're already too small.
Consider Spot for the truly stateless tier. Right-size first, then move what's safe to Spot for another 60-70% off.

What you'll save

Indicative results from real engagements:

200-instance fleet, mostly m5.xlarge: 38% reduction, $14k/month → $8.6k/month
50-instance ECS fleet on c5.large: 41% reduction plus Graviton switch
12-instance database tier: 22% reduction (database tiers are usually properly sized to start with)

If you want this run on your fleet without the 14-day wait, book a call — I can typically find the savings in an afternoon.