ElastiCache Redis Cost Optimisation: Stop Overpaying
ElastiCache Redis bills inflate through oversized nodes, unused replicas, and ignored eviction. Here's how to cut 30-50% without any application changes.
By Andrii Votiakov on
Redis is one of those services that gets provisioned during a performance crisis and never revisited. The node size that stopped the fire stays running for two years, the replica count grows "for safety," and nobody checks whether the cache is actually being used well. I've seen Redis bills that were 4x what they needed to be.
Quick answer
Most ElastiCache Redis bills carry 30-50% waste. The main sources: over-sized nodes from one-time peak provisioning, replicas that aren't read from, cluster mode enabled where it's not needed, missing Reserved Node coverage, and persistence (AOF/RDB) enabled on caches that don't need it. None of this requires application changes to fix.
Node right-sizing: the numbers to pull
For ElastiCache Redis, the key metrics are used memory, CPU utilisation, and cache hit rate. Pull them for 14 days.
The sizing rule I use:
- Used memory at peak: your node should have 20-30% headroom above the actual peak watermark. Not 200% headroom.
- Engine CPU (not host CPU): Redis is single-threaded for commands. If Engine CPU is under 25% at peak, you're not CPU-constrained.
- Cache hit rate: if it's above 98%, great. If it's 85%, you might have a key eviction problem — not a node size problem.
A common pattern: cache.r6g.2xlarge (52 GB RAM, ~$750/month per node) running with 8 GB actually used. The team sized for "what if we cache everything" and never cached everything. Drop to cache.r6g.large (13 GB RAM, ~$180/month) with buffer. Same performance, a quarter of the cost.
Cluster mode: only when you need horizontal scale
ElastiCache Redis Cluster Mode partitions data across shards. Each shard has a primary and replicas. The cost multiplies fast: 3 shards × 2 replicas = 9 node-hours per hour.
Cluster mode makes sense when:
- Your dataset is too large for a single node's RAM
- You need write throughput that a single primary can't sustain
- You need to scale reads across multiple primaries
It does not make sense when your data fits in one node. I find cluster mode enabled on 30 GB datasets on 3-shard configurations — that's paying for 3 primaries when 1 would do. Single-node or single-shard with replicas handles most workloads below 50 GB.
Check current configuration with:
aws elasticache describe-replication-groups \
--query 'ReplicationGroups[*].[ReplicationGroupId,ClusterEnabled,MemberClusters]' \
--output table
If ClusterEnabled is true and your dataset fits comfortably in the primary node size, you're paying a cluster premium for nothing.
Replicas: are they actually being read?
Each ElastiCache replica is a full node-hour. On a cache.r6g.xlarge at ~$180/month, adding two replicas triples the cost of that replication group.
Replicas exist for two reasons: read scale-out and failover. For failover, one replica is enough. For read scale-out, reads need to actually route to replicas.
Check your application's Redis client configuration. Most codebases I review use a primary endpoint for all operations — reads and writes — because it's simpler and Redis is fast enough that developers never bothered routing reads to replicas. If that's the case, each replica beyond the first one is pure redundancy cost. One replica for failover, zero for scaling you're not using.
Reserved Nodes: the easy saving nobody does
ElastiCache Reserved Nodes work the same way as EC2 Reserved Instances. 1-year all-upfront gives roughly 30-40% off on-demand pricing.
After right-sizing, buy Reserved Nodes for steady-state production clusters. The only clusters I wouldn't reserve: dev/test environments (too volatile) and anything you're planning to migrate or shut down within 6 months.
A cache.r6g.large at on-demand $0.166/hour costs $1,455/year. A 1-year all-upfront reserved node for the same costs roughly $910 — saving $545 per node per year. Multiply by every production node you're running.
Persistence costs: AOF and RDB on caches that don't need it
Redis supports two persistence modes: RDB (periodic snapshots) and AOF (append-only file, log every write). Both add I/O overhead, and with ElastiCache, both require node types that support persistence (R and M families only — T families don't support AOF).
The cost issue: many teams enable AOF persistence on caches used as pure ephemeral caches — session data, rate limit counters, query result caches. If your cache is reconstructible from source of truth in the event of a restart, you don't need persistence. Disabling it means you can potentially use smaller nodes.
More directly: if AOF is enabled on large caches with high write rates, you may be forced onto larger node sizes to handle the I/O. Disabling persistence and right-sizing in tandem can be a meaningful saving.
Redis Cloud vs ElastiCache vs Memorystore: which is cheaper
If you're evaluating options rather than optimising an existing setup:
ElastiCache is the right choice if you're already deep in AWS and value native VPC integration, IAM auth, and CloudWatch metrics. Pricing is competitive at scale.
Redis Cloud (Upstash, Redis Enterprise Cloud) makes sense for small teams who want Redis-as-a-service without VPC complexity, or for multi-cloud setups. Upstash's per-command pricing is often cheaper below 50 million commands/month. Above that, a reserved ElastiCache node wins on cost.
Google Cloud Memorystore is the clean choice if you're on GCP. Pricing is similar to ElastiCache. The same right-sizing logic applies.
Don't mix providers without a reason. Managing Redis across ElastiCache and Redis Cloud simultaneously adds operational overhead and usually isn't cheaper. The same right-sizing discipline applies to managed Postgres — oversized instances and unused replicas are the most common waste patterns across both.
Eviction policies: memory efficiency that doesn't cost money
The right eviction policy means your cache uses memory efficiently and you don't need to over-provision. The wrong policy means you either lose data you needed or waste RAM on data you don't.
Policies that work well for most use cases:
allkeys-lru: evicts the least recently used key from all keys. Good for general caching.volatile-lru: only evicts keys with a TTL set. Good if you have a mix of permanent and temporary data and need to protect the permanent keys.noeviction: Redis returns an error when memory is full. Only right for queues or scenarios where data loss is unacceptable — not for caches.
I regularly find production caches running noeviction because that's the default. The application throws errors during traffic spikes rather than gracefully evicting old cache entries. The team responded by bumping node size. The real fix is changing the eviction policy.
What I check in an actual audit
- Used memory vs. node capacity (14-day peak)
- Engine CPU utilisation pattern
- Replica count and whether reads route to replicas
- Cluster mode configuration vs. actual dataset size
- Reserved Node coverage percentage
- Persistence mode (AOF/RDB on) vs. whether data is reconstructible
- Eviction policy and eviction counter over time
Realistic numbers
Recent client running ElastiCache (~$7,800/month):
- Right-sizing 6 oversize nodes: $2,100/month
- Removing 4 unused replicas: $1,400/month
- Disabling cluster mode on 2 small single-shard clusters and reconfiguring: $600/month
- 1-year Reserved Nodes on production clusters: $1,200/month
- Persistence disabled on 3 pure-cache clusters (enabled smaller node tier): $400/month
Total: $5,700/month, ~73% of the original bill. Three days of configuration work.
If you want me to do the same on your Redis or ElastiCache setup on a pay-for-savings basis, book a call.