Cloud Cost Reduction: The Complete 2026 Guide
How to cut cloud costs 30-90%: a vendor-agnostic guide to reducing AWS, GCP, Azure, and SaaS bills with real audit findings and concrete next steps.
By Andrii Votiakov on
AWS, GCP, and Azure bills routinely carry 30-70% in pure waste. I've audited over a hundred cloud accounts across all three platforms. The same patterns appear every time: over-provisioned compute that's never been revisited, observability stacks charging more than the infrastructure they monitor, and not a single commitment discount in sight. One recent client was spending $42,000/month. After 30 days of structured work, they were at $22,500. No re-architecture. No risk. Just removing what shouldn't have been there.
Key Takeaways
- Most cloud bills carry 30-70% waste. The biggest categories are compute over-provisioning, observability bloat, and zero commitment discounts.
- Visibility comes first. You can't cut what you can't see - tagging, Cost Explorer, and billing exports are the foundation.
- Observability tools (Datadog, CloudWatch) often account for 15-25% of total cloud spend and are routinely ignored in cost reviews. (Flexera State of the Cloud Report 2024)
- Commitment discounts (Savings Plans, Reserved Instances, CUDs) alone can cut 25-40% from the steady-state compute bill.
- A structured 30-day sequence delivers 30-50% savings without disruption if you follow the right order.
Quick answer
To cut cloud costs, work in this sequence. First, get full visibility: tag everything, pull Cost Explorer or GCP Billing by service, and find your top five spend categories. Second, delete waste: unattached volumes, idle load balancers, forgotten experiments. Third, right-size compute using two weeks of actual metrics. Fourth, fix storage tiering and observability cardinality. Fifth, buy commitment discounts only after right-sizing. Sixth, review your SaaS bills for tools you could self-host once spend crosses $10-20k/month. Done in order, this reliably cuts 30-90% from real bills.
Why cloud bills grow without anyone noticing
Most infrastructure is built under pressure. Engineers are moving fast, often on cloud credits, and cost is not the primary concern at that stage. An instance gets sized generously because nobody wants a production incident. A managed service gets chosen because it's faster to launch. Full-tier monitoring gets turned on in dev because it was already on in prod. Nobody goes back.
Then the company grows. The infrastructure scales roughly in line with revenue, but the waste scales too. A fleet that cost $5,000/month at 10x waste still felt manageable. At $50,000/month, it's a problem. The bill grows, but the waste percentage stays roughly constant until someone deliberately audits it.
There's also a visibility problem. Cloud bills are complex. On AWS alone, EC2 pricing has dozens of dimensions: instance family, generation, size, region, operating system, tenancy, billing model. Most teams look at the total and shrug. Without tagged resources and a per-service breakdown, the root causes stay invisible.
The third reason is ownership. When infrastructure is shared, nobody owns the cost line. Engineers optimise for reliability and speed. Finance sees the number but can't interpret it. There's no feedback loop.
That's the pattern I see in every audit. It's not negligence. It's how growth works.
The seven patterns of cloud waste
I catalogued the patterns I find most often in a dedicated post on cloud waste. Here's the short version.
The forgotten experiment. An engineer spun up a GPU instance or managed data platform for a one-off job. They got what they needed. The resource kept running. I've found $4,200/month EMR clusters running for 14 months with no owner.
Production-tier setup on dev and staging. Multi-AZ RDS, full Datadog agents, large instances, and 24/7 uptime. On environments that could tolerate a 4-hour outage. This is typically 10-15% of total cloud spend, wasted.
The data-transfer black hole. Cross-AZ service calls, NAT Gateway processing millions of small requests, CDN bypassed in favour of direct S3 egress. One chatty microservice pair was costing a client $11,400/month in data transfer alone.
Over-provisioned compute. Instances sized at launch, never revisited. Average CPU under 20%. Kubernetes memory requests set at 2GB for pods using 200MB. This single pattern is often 30-60% of compute spend.
Storage forever. Manual RDS snapshots from 2022, S3 versioning on with no lifecycle policy, CloudWatch log groups set to "Never expire." Quiet accumulation, routinely $2-10k/month.
Observability cardinality bloat. One metric with a
user_iddimension can generate millions of unique time-series. I found a single Datadog metric costing $7,600/month. Read the full pattern breakdown here.Commitment-discount black hole. No Savings Plans, no Reserved Instances, no Committed Use Discounts. Pure on-demand for a steady-state baseline that has been running for two years. Foregone savings: 25-40%.
Step 1: Get visibility before you touch anything
Without tagged resources and a cost-by-service breakdown, you cannot reliably cut cloud costs. Flexera's 2024 State of the Cloud Report found that 82% of respondents cite cloud cost management as a top challenge, with visibility gaps identified as the primary reason savings efforts stall. (Flexera State of the Cloud Report, 2024)
Visibility is not optional. It's the foundation. Before you right-size a single instance or delete a single snapshot, you need to know what's actually driving the bill.
On AWS, that means Cost Explorer grouped by service, then by linked account, then by resource tag. Turn on the Cost and Usage Report (CUR) and send it to S3 if you want query-level analysis. Set up cost allocation tags and make them mandatory: team, service, env at minimum. Turn on Cost Anomaly Detection today if it's not already on.
On GCP, it's Billing Export to BigQuery. The built-in reports are fine for a summary, but the BigQuery export gives you full query flexibility. Use labels the same way you use tags on AWS.
On Azure, Cost Management + Billing is the native tool. The subscription and resource group hierarchy already provides some structure, but consistent tagging across resource groups is still essential.
The goal at this stage: a single page showing your top five services by spend, your top five resources per service, and a tagging coverage percentage. If untagged resources exceed 20% of spend, fix tagging before doing anything else.
Step 2: Right-size compute
Over-provisioned compute is the single largest category of cloud waste. AWS's own Compute Optimizer data suggests that 76% of EC2 instances are over-provisioned, with median CPU utilisation below 20% (AWS re:Invent 2023 keynote data, 2023). Right-sizing alone typically cuts 15-40% from the total compute bill.
The rule is simple: pull 14 days of actual CPU and memory metrics, then downsize anything running below 20% average CPU. One size class down is usually safe with no application change. Two sizes down needs a test window. The math applies to EC2 instances, ECS tasks, Fargate definitions, Lambda memory allocations, and Kubernetes pod requests.
For EC2, the fastest path is AWS Compute Optimizer. It generates per-instance recommendations with a confidence level based on observed utilisation. I don't take every recommendation blindly, but the list is a reliable starting point. The 14-day EC2 right-sizing method covers the exact process I follow.
For Kubernetes on EKS, the problem is usually pod requests, not node sizes. A deployment may request 2GB of memory but use 200MB. Nodes get filled with reserved-but-unused space. Karpenter helps on the node side, but you also need to fix the requests. The EKS cost optimisation guide covers both.
For serverless: Lambda memory settings directly determine CPU allocation and cost per GB-second. Most Lambda functions are over-allocated on memory. Power Tuning (the open-source tool) finds the optimal setting in 20 minutes. The Lambda cost optimisation post covers this in detail, alongside ARM (Graviton2) migration which is typically a 20% cost cut with one deploy flag.
For Fargate: task CPU and memory are billed as provisioned, not used. The Fargate cost guide and Azure Functions guide cover the platform-specific levers. For a cross-platform comparison of serverless costs, Cloud Run vs Lambda vs Azure Functions gives real numbers at 5M, 50M, and 500M invocations.
Step 3: Tame your databases
Databases are typically the second-largest line item on cloud bills and the one teams touch least. RDS alone can account for 20-35% of an AWS bill, with most of the cost coming from over-provisioned instance sizes and unused read replicas running 24/7 (AWS Pricing Calculator benchmarks, 2025).
Databases get sized for peak load at launch, then stay that size forever. A db.r6i.4xlarge that was right-sized for a traffic spike at launch may be running at 15% memory utilisation two years later. Check buffer pool hit rates, I/O patterns, and connection counts before touching anything else.
For RDS and Aurora, the key levers are instance right-sizing, removing Multi-AZ from non-production environments, cleaning up old manual snapshots, switching from gp2 to gp3 storage (automatic 20% cost reduction with equal or better performance), and buying Reserved Instances for databases that run 24/7. The RDS cost optimisation guide covers all of these.
For managed Postgres across Aurora, Cloud SQL, and Azure Database, the pattern is similar but IOPS provisioning is the most common source of overspend. The managed Postgres cost guide addresses this specifically.
For MongoDB Atlas: clusters are billed by tier, and most teams provision M30+ when M10 or M20 would handle the workload. MongoDB Atlas cost optimisation covers right-sizing, backup accumulation, and Data Federation costs.
For Redis via ElastiCache: node sizes, replica counts, and backup retention are the three levers. The ElastiCache Redis guide covers them.
For analytics databases: Snowflake cost optimisation focuses on warehouse sizing and auto-suspend settings, where I routinely find 50-70% cuts. BigQuery cost optimisation covers partition pruning, slot reservations, and the SELECT * problem.
Step 4: Fix networking and CDN
Networking charges are the line item that surprises teams most. AWS charges $0.01 per GB for cross-AZ traffic. At 100TB/month of cross-AZ data transfer, that is $1,000/month for traffic that could often be eliminated entirely. NAT Gateway adds $0.045/GB processed on top of that. (AWS EC2 Pricing, 2026)
Data transfer costs are invisible until they're not. A microservices architecture with services in different Availability Zones, or a Lambda function calling an RDS instance in a different AZ, or pods pulling container images from public ECR across AZ boundaries: each of these looks free until you see the Cost Explorer breakdown.
The most common fix is VPC endpoints. S3, DynamoDB, ECR, CloudWatch Logs, and STS all support VPC gateway or interface endpoints. Traffic that was routing through NAT Gateway goes direct instead. One client eliminated $8,200/month in NAT charges this way over two weeks.
The NAT Gateway cost guide covers the specific configuration. The AWS data transfer charges post maps the full billing model so you can find the root causes in your own bill.
For CDN: if you're serving large assets (images, videos, large JS bundles) directly from S3 or your origin, you're paying for every byte twice - storage egress from S3 and cloud egress to users. A CDN in front cuts egress by 70-90% by caching at edge. The CDN cost showdown compares Cloudflare, CloudFront, and Fastly at real traffic volumes.
If you're already on Cloudflare, R2 (their object storage) has zero egress fees. For workloads with high read traffic, the Cloudflare R2 vs S3 comparison shows when the switch pays off.
Step 5: Storage tiering
S3 Standard costs $0.023/GB/month. S3 Glacier Instant Retrieval costs $0.004/GB/month, a savings of 83% on storage. Most teams never set a lifecycle policy, leaving all objects in Standard forever regardless of access frequency. (AWS S3 Pricing, 2026)
Storage accumulates silently. Objects get written and never deleted. Log files grow forever. Snapshot policies create new backups without expiring old ones. S3 versioning turned on without a lifecycle rule means every object version is billed forever.
The fix is lifecycle policies. Old application logs can move to S3 Intelligent-Tiering or Standard-IA after 30 days. Anything not accessed in 90 days moves to Glacier Instant Retrieval. Anything not accessed in 180 days moves to Glacier Deep Archive. Multi-part uploads that never completed (a surprisingly common billing line) can be aborted after 7 days.
The S3 storage class optimisation guide covers the class selection logic with real numbers, including the access frequency thresholds where each class pays off. The same tiering principle applies to Azure Blob storage tiers and GCP Cloud Storage classes.
Step 6: Cut the observability bill
Observability tools account for 15-25% of total cloud spend for most engineering teams, yet they're routinely excluded from cost review conversations. Datadog's per-host, per-log, and per-custom-metric pricing compounds: a team with 200 hosts, 50GB/day of logs, and 500 custom metrics faces a Datadog bill that can exceed $30,000/month. (Datadog Pricing, 2026)
I've found observability to be the most emotionally difficult category to optimise. Engineers feel that cutting monitoring is cutting safety. That's usually not what's happening. What's usually happening is that DEBUG logs are shipping to Datadog in production, metrics are tagged with request_id creating millions of unique series, and every dev host has a full Datadog agent. None of that is actually required for production visibility.
The first target is log volume. In production, you rarely need DEBUG logs. Sampling 200 OK responses at 1-5% instead of 100% cuts log volume by 60-80% with no loss of meaningful signal. Dropping health check pings entirely is usually safe. These changes go in your logging configuration - no application code change required.
The second target is custom metrics cardinality. A metric like api.request.duration tagged with {user_id: "abc123"} creates a unique time-series per user. With 100,000 users, that's 100,000 series for one metric. Replace high-cardinality tags with low-cardinality ones: user_tier, region, endpoint_group instead of exact IDs.
The CloudWatch cost optimisation guide covers log retention, metric streams, and container insights. The Datadog cost optimisation guide goes deep on cardinality, log indexes, and infrastructure pruning.
If your Datadog bill has crossed $20-30k/month, self-hosting the stack starts to make economic sense. The replacing Datadog post covers the open-source stack I deploy and what to expect on the migration.
Step 7: Lock in commitment discounts
AWS Compute Savings Plans provide up to 66% discount on Lambda and Fargate usage, and up to 72% on EC2 compute compared to on-demand pricing. The catch: you must commit before right-sizing, or you lock in waste at a discount. (AWS Savings Plans Pricing, 2026)
Commitment discounts are the highest-leverage lever per hour of work. A 1-year Compute Savings Plan on a steady-state EC2 fleet saves 30-40% on those instances with almost no commitment risk: Compute Savings Plans flex across instance family, size, region, and OS. You're committing to a dollar-per-hour spend level, not to specific instances.
The decision rules I use:
- Right-size first, always. Locking in a discount before right-sizing saves money on the over-provisioned baseline. Then you right-size and lose the coverage.
- 1-year before 3-year. The extra 10-15% discount on 3-year terms rarely justifies locking in for that long unless the workload is genuinely stable and the business is healthy.
- Compute Savings Plans before Instance Savings Plans. The flexibility is worth the slightly lower discount rate.
- RDS Reserved Instances for any database running 24/7. Standard Reserved Instances on RDS save 40-50% on multi-AZ instances.
The Savings Plans vs Reserved Instances guide walks through every AWS commitment type with the decision framework I use on every audit.
For Azure, the commitment picture is different: Reserved VM Instances, Azure Savings Plans, and Azure Hybrid Benefit (if you have existing Windows or SQL licenses) all interact. The Azure reservations and savings plans guide covers the priority order.
| Discount Type | Typical Saving | Flexibility | Best For |
|---|---|---|---|
| Compute Savings Plans (AWS) | 30-66% | High - any compute | Mixed EC2/Lambda/Fargate fleets |
| EC2 Instance Savings Plans | 40-72% | Medium - fixed family/region | Stable single-family EC2 |
| RDS Reserved Instances | 40-55% | Low - specific DB | Any DB running 24/7 |
| GCP Committed Use Discounts | 28-57% | Medium - vCPU/memory | Stable Compute Engine |
| Azure Reserved VM Instances | 30-60% | Low - specific VM size | Stable VM workloads |
| Azure Savings Plans | 15-35% | High | Mixed compute workloads |
Step 8: Replace SaaS where it pays
SaaS spending grew 18% year-over-year in 2024, with the average mid-size company running 130+ SaaS tools. Build-vs-buy economics shifted significantly when AI-assisted development reduced custom build effort by 40-60% for standard integrations. (Productiv SaaS Trends Report, 2024)
SaaS replacement is the category where I see the biggest variance between clients. Some teams are paying $80,000/year for auth, $60,000/year for email, and $40,000/year for support tooling, all at volumes where self-hosting would cost $5,000/year in infrastructure. Others are at volumes where every SaaS tool is genuinely the cheapest option. The decision is about crossing specific usage thresholds, not ideology.
The build vs buy 2026 framework covers the decision model. Here are the specific replacements I run into most:
Auth: Auth0 and Okta CIAM price per monthly active user. Above 10,000 MAU, self-hosted Keycloak or Ory Kratos becomes meaningfully cheaper. Above 50,000 MAU, it's rarely even close. The Auth0 replacement guide covers the migration and what to watch for with SOC 2 compliance.
Communications: Twilio per-message and per-minute pricing adds up at scale. At 500,000+ SMS/month or meaningful voice volume, alternatives like Telnyx, Vonage, or direct operator connections pay off. See the Twilio replacement guide.
Observability: Covered in Step 6. At $20-30k+/month, self-hosting Grafana, Prometheus, and Loki is worth the ops cost. The replacing Datadog guide shows the stack.
Support tooling: Intercom's per-resolution pricing scales badly. At 5,000+ resolutions/month, the bill is usually $15,000-40,000/month. The replacing Intercom guide covers the alternatives, including AI-assisted triage that handles 60-80% of volume.
Step 9: Execute it as a 30-day plan
A structured 30-day cloud cost sprint delivers 30-50% savings when tasks are sequenced correctly. The sequencing matters: commitment discounts bought before right-sizing lock in wasted spend, and tagging done after chargeback is nearly impossible. ([reducecost.cloud audit data, 2024-2026])
Knowing what to do is one thing. Getting it done without disrupting production, without wrong-ordering the tasks, and with enough documentation for the next person is another.
The cloud cost optimisation 30-day plan is the structured playbook I run on every engagement. Days 1-7 focus on visibility and quick wins (delete zombies, set retention, fix obvious waste). Days 8-14 are right-sizing compute and storage. Days 15-21 are migrations where confidence is high and commitment discounts once the floor is known. Days 22-30 are tagging, chargeback, guardrails, and documentation.
Real output from a recent run of this plan on a $42,000/month account: $19,500 monthly saving (46% reduction) in 30 calendar days. The breakdown is in the 30-day guide.
Vendor-by-vendor playbooks
The principles of cloud cost reduction are vendor-agnostic. The tools and terminology aren't. Here are the vendor-specific guides.
Across 100+ audits I've run since 2022, the average waste percentage by platform breaks down as follows (based on clients who came in without prior cost work):
| Cloud Vendor | Typical Waste % | Top 3 Levers |
|---|---|---|
| AWS | 35-60% | EC2 right-sizing, NAT Gateway, no Savings Plans |
| GCP | 30-55% | Compute over-provisioning, BigQuery SELECT *, no CUDs |
| Azure | 30-50% | VM sizing, SQL tier thresholds, Monitor ingestion |
| Vercel/Netlify | 40-80% | Function invocation spikes, bandwidth overages, wrong plan |
| Heroku | 50-70% | Dyno sizes, add-on over-provisioning, cheaper alternatives exist |
| Snowflake | 40-70% | Warehouse sizing, always-on warehouses, no auto-suspend |
For AWS: the AWS bill reduction checklist is the fastest starting point. The GCP cost optimisation playbook covers Compute Engine, GKE, BigQuery, and the hidden charges in GCP that catch teams off-guard. The Azure cost optimisation playbook covers VM right-sizing, SQL Database tier selection, and Azure Monitor ingestion.
For smaller platforms: the Vercel and Netlify cost guide covers the plan traps and when to migrate off. The Heroku exit strategy maps the migration paths to Render, Fly.io, and AWS with realistic effort estimates. The Cloudflare billing deep dive covers Workers, R2, Stream, and Argo.
For multi-cloud setups: when multi-cloud actually pays gives an honest assessment. Spoiler - most of the time it doesn't, but there are specific workload patterns where it does.
What it looks like in practice
Three real-style client examples. Numbers are real; details are anonymised.
Case study 1: B2B SaaS, AWS, $58k/month
A 60-person SaaS company had grown from $5k/month to $58k/month in three years. Nobody had done a cost review since the seed round.
What I found:
- EC2 fleet: 180 instances, median CPU utilisation 12%. Mostly
m5.xlargeandm5.2xlargethat had been running since 2022. - RDS: 4 Multi-AZ Aurora clusters. Two were dev/staging. Multi-AZ on staging alone: $3,200/month.
- NAT Gateway: $9,100/month. The main cause was ECS pulling container images from public ECR through NAT instead of VPC endpoints.
- Datadog: $22,000/month. 14 dev hosts with full agents, 3 metrics with
request_idtag generating 2M+ unique series. - No Savings Plans on anything.
Result: After 30 days - $27,400/month. Saving of $30,600/month (53%). No re-architecture. The biggest single change was turning off Datadog agents on dev and fixing two metric tags.
Case study 2: Fintech, GCP + Snowflake, $34k/month
A financial data platform built on GCP Compute Engine and Snowflake.
What I found:
- Compute Engine: 22
n2-standard-16instances (64GB RAM) for workloads using 8-12GB. The team had chosen the larger instances because they "might need it." - Snowflake: 4 virtual warehouses, two of which had auto-suspend set to "never." Combined idle cost: $7,800/month.
- BigQuery: several scheduled queries with full table scans on 10TB tables.
SELECT *on unrestricted tables, no partitioning. - No GCP Committed Use Discounts.
Result: $16,200/month after 30 days. Saving of $17,800/month (52%). The Snowflake auto-suspend fix alone: $7,200/month in week one.
Case study 3: E-commerce, AWS, $91k/month
A retailer running a large AWS footprint with multi-region deployments.
What I found:
- Cross-region data replication running at full volume in both directions. One direction was for disaster recovery only and had never been used. $14,300/month.
- CloudWatch Logs: 380 log groups, 220 set to "Never expire." Oldest logs: 2019. Retention set to 30 days on everything fixed $5,800/month.
- Lambda: 200+ functions, average memory set to 1024MB, average actual peak usage 140MB. Fixing memory allocation to 256MB saved $4,100/month.
- EBS volumes: $3,200/month in unattached volumes. 89 volumes, none attached, some from 2021.
Result: $51,200/month after 30 days. Saving of $39,800/month (44%). The majority came from architecture rationalisation (disabling the unused replication direction) and log retention.
Common mistakes I see
These are the things I watch out for on every audit. Most are easy to fix once you know to look.
Buying commitment discounts before right-sizing. This locks in waste at a discount. The sequence is always: right-size first, then commit. I see this at least once a month.
Setting "Never expire" on CloudWatch log groups. The default retention when you create a log group via CDK or Terraform without specifying retention is "Never expire." This is a billing trap. Always set retention explicitly.
Monitoring dev environments the same as production. Full Datadog agents, full log ingestion, full retention. Dev doesn't need this. A reduced-tier agent and 3-day log retention on dev is almost always sufficient.
Ignoring NAT Gateway data transfer. Teams see a line item called "NatGateway-Bytes" and assume it's fixed. It isn't. It's traffic-driven and almost always reducible through VPC endpoints or architecture adjustments.
Right-sizing based on peak CPU, not average. Peak CPU on an instance might be 80% for 10 minutes per day. Average might be 7%. Sizing for peak is appropriate for some workloads but is automatic over-provisioning for most. Check p99 and average, not just max.
Skipping guardrails at the end. Without budget alerts, anomaly detection, and a monthly review meeting, the bill will grow back. I've seen a 50% savings evaporate in 9 months because nobody set up guardrails.
Treating SaaS costs as fixed. SaaS contracts renew. Pricing tiers change. Usage grows. The SaaS bill that was 10% of cloud spend last year might be 25% this year. It warrants the same quarterly review as the infrastructure bill.
Realistic numbers: what cloud cost reduction actually delivers
This is a composite breakdown based on a typical $50,000/month AWS account with no prior cost work done.
| Category | Before | After | Saving | Method |
|---|---|---|---|---|
| EC2 compute | $18,000 | $9,200 | $8,800 (49%) | Right-sizing + Graviton + 1yr Savings Plans |
| RDS / databases | $11,000 | $5,800 | $5,200 (47%) | Right-sizing + RDS RIs + remove dev Multi-AZ |
| Observability (CloudWatch + Datadog) | $10,500 | $3,800 | $6,700 (64%) | Log retention + cardinality + dev agent removal |
| Data transfer / NAT | $7,200 | $1,900 | $5,300 (74%) | VPC endpoints + CDN + cross-AZ fix |
| Storage (S3, EBS, snapshots) | $3,300 | $1,600 | $1,700 (52%) | Lifecycle policies + snapshot cleanup |
| Total | $50,000 | $22,300 | $27,700 (55%) |
This is not a best-case scenario. It's a realistic median for an account that has never been formally audited. Accounts with prior cost work done will see smaller percentages; accounts with specific problem areas (Snowflake, Datadog, no Savings Plans at all) can see more.
The 30-90% range I cite reflects real audit outcomes - the lower end for accounts with some prior work, the upper end for specific high-waste patterns like a Snowflake account with no auto-suspend or an observability setup with severe cardinality bloat.
Frequently asked questions
How long does cloud cost reduction take?
With the 30-day plan, most accounts see 30-50% savings within 30 days. Quick wins (zombie resources, log retention, dev instance schedules) typically deliver 5-10% in the first week. Right-sizing and commitment discounts take 2-3 weeks once data is collected. See the 30-day plan for the exact sequencing.
Is it safe to right-size production databases?
Yes, with the right approach. Pull 14 days of metrics first. Change one instance at a time during a maintenance window. Monitor for 48-72 hours before moving on. RDS supports online instance class changes with a brief restart - typically 2-5 minutes for Multi-AZ instances.
What if I don't have full tagging coverage?
Start anyway. Use Cost Explorer's resource-level view to identify the biggest untagged costs by resource ID, then work backward to the owning team. Tagging retroactively is possible via AWS Tag Editor and GCP Resource Manager. Fixing 80% of tagging coverage delivers 80% of the benefit.
Should I buy 1-year or 3-year commitment discounts?
Almost always 1-year first. The additional 10-15% discount on 3-year terms doesn't justify the lock-in risk unless the workload is genuinely stable and the business is in a position where a 3-year commitment is reasonable. See the Savings Plans vs Reserved Instances guide for the full framework.
How do I stop the bill from growing back?
Three things: budget alerts at 50/80/100% of expected spend per account, anomaly detection turned on at the service level, and a monthly 30-minute cost review with the engineering leads. Without these, savings typically erode within 6-12 months as new resources are added without cost oversight.
If you'd rather have someone do this for you on a pay-for-savings basis, book a call. 30-minute conversation. I'll tell you where the money is going before we discuss anything else. You only pay if I actually save you money.