EKS Cost Optimisation: Karpenter, Spot, Right-Sizing
EKS bills explode quietly. Here's the playbook I run on every cluster: control plane, node sizing, Karpenter, Spot, and the request/limit fixes that pay back fastest.
By Andrii Votiakov on
EKS clusters are where engineering best intentions go to overspend. The control plane is a fixed cost, but everything else — nodes, networking, observability, ingress — compounds quietly until your $5k/month workload is a $25k/month bill. Most of it is fixable in two to four weeks.
Quick answer
EKS optimisation has five levers in priority order: (1) Karpenter for bin-packing and instance flexibility, (2) Spot for stateless workloads, (3) right-sized requests/limits, (4) Graviton wherever possible, (5) cluster consolidation. Apply all five and a typical EKS bill drops 40-65% with no functional change.
The fixed costs first
- Control plane: $0.10/hour per cluster ($72/month). Doesn't scale, doesn't matter at small scale.
- Cluster autoscaler vs Karpenter: free.
- CNI plugin (VPC CNI): free, but ENI count limits per node affect density.
- Add-ons (CoreDNS, kube-proxy, metrics-server): free.
The big spend is everywhere else.
1. Replace Cluster Autoscaler with Karpenter
If you're still on Cluster Autoscaler in 2026, this is the fastest 20-30% reduction available. Karpenter:
- Picks instance types dynamically based on pending pods, not pre-defined node groups
- Handles Spot natively, including interruption
- Bin-packs better, leading to higher utilisation
- Consolidates underused nodes automatically (
consolidationPolicy: WhenEmptyOrUnderutilized)
Minimal NodePool example:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-cpu
operator: In
values: ["2", "4", "8", "16"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
Note the broad instance flexibility — Karpenter picks the cheapest fit, including ARM where allowed.
2. Spot for stateless
Anything stateless and tolerant to a 2-minute eviction notice should run on Spot. That's typically 60-75% off on-demand. With a broad pool of allowed instance types, Spot interruption rates stay low (single-digit percent in eu-west-1 across 2025 in my experience).
Mix Spot + on-demand in the same cluster:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
weight: 100 # higher weight for Spot pool
Plus a separate NodePool for workloads tagged "on-demand only" (databases, queue consumers with stateful work).
3. Right-size requests, not just limits
This is the cheapest fix and the one nobody does. Most workloads are wildly over-provisioned at the request level.
Audit:
kubectl top pods --all-namespaces --containers \
| awk '{print $1, $2, $3, $4}' \
| sort -k 4 -rh
Compare to deployments:
kubectl get deployments -A -o json \
| jq -r '.items[] | "\(.metadata.namespace) \(.metadata.name) \(.spec.template.spec.containers[0].resources.requests.cpu) \(.spec.template.spec.containers[0].resources.requests.memory)"'
If actual CPU usage is 5% of requested CPU consistently, the request is wrong. Lower it. Same for memory (with a safety margin — OOM is worse than wasted request).
Tools that automate this: Goldilocks, KRR (Kubernetes Resource Recommender), VPA in recommendation mode. Goldilocks is a good first stop because it surfaces recommendations as a dashboard.
4. Graviton for nodes
Add arm64 to your requirements and rebuild your container images for linux/arm64. Most modern stacks (Node, Python, Go, Java 11+) work without changes. Graviton instances are typically 20% cheaper at equal or better performance.
Add a taint for arm64 and let pods opt in via tolerations during the migration if you want a phased rollout.
5. Cluster consolidation
Many companies run 5-15 EKS clusters when 2-3 would do. Reasons it spreads:
- "Per-team isolation" — solved with namespaces, NetworkPolicies, and quotas
- "Per-environment" — usually fine with one staging + one prod
- "Per-region" — only if you have real multi-region traffic
Each cluster pays for control plane + add-on observability + ingress + Karpenter overhead. Consolidating from 8 clusters to 3 saves ~$2-5k/month before you touch a single workload.
6. Networking and observability
Quick checks I always run:
- NAT Gateway processing: see the NAT Gateway post. EKS is a major NAT spender.
- Cross-AZ traffic: enable topology-aware routing on Services
- CloudWatch Container Insights: check cardinality (see the CloudWatch post)
- Datadog/New Relic agent: per-pod or per-host pricing — pick what fits your fleet shape
7. Pause non-prod overnight
Dev and staging clusters running 24/7 are paying for nights and weekends. KEDA, kube-downscaler, or a cron Lambda that sets your NodePool count to 0 between 19:00-07:00 cuts non-prod spend ~50% with one day's work.
Real numbers
Recent client (~$11k/month EKS):
- Cluster Autoscaler → Karpenter (with consolidation): $1,800/month
- Spot for ~70% of nodes: $2,400/month
- Resource request right-sizing across 80 deployments: $1,200/month
- Graviton migration on 3 services: $700/month
- Non-prod cluster overnight pause: $900/month
- ECR + Logs VPC endpoints (NAT cuts): $600/month
Final: $3,400/month, ~69% reduction.
If you want help running this on your cluster — including the Karpenter and Spot transition — book a call.