EKS Cost Optimisation: Karpenter, Spot, Right-Sizing

EKS bills explode quietly. Here's the playbook I run on every cluster: control plane, node sizing, Karpenter, Spot, and the request/limit fixes that pay back fastest.

By Andrii Votiakov on 2026-03-19

EKS clusters are where engineering best intentions go to overspend. The control plane is a fixed cost, but everything else — nodes, networking, observability, ingress — compounds quietly until your $5k/month workload is a $25k/month bill. Most of it is fixable in two to four weeks.

Quick answer

EKS optimisation has five levers in priority order: (1) Karpenter for bin-packing and instance flexibility, (2) Spot for stateless workloads, (3) right-sized requests/limits, (4) Graviton wherever possible, (5) cluster consolidation. Apply all five and a typical EKS bill drops 40-65% with no functional change.

The fixed costs first

  • Control plane: $0.10/hour per cluster ($72/month). Doesn't scale, doesn't matter at small scale.
  • Cluster autoscaler vs Karpenter: free.
  • CNI plugin (VPC CNI): free, but ENI count limits per node affect density.
  • Add-ons (CoreDNS, kube-proxy, metrics-server): free.

The big spend is everywhere else.

1. Replace Cluster Autoscaler with Karpenter

If you're still on Cluster Autoscaler in 2026, this is the fastest 20-30% reduction available. Karpenter:

  • Picks instance types dynamically based on pending pods, not pre-defined node groups
  • Handles Spot natively, including interruption
  • Bin-packs better, leading to higher utilisation
  • Consolidates underused nodes automatically (consolidationPolicy: WhenEmptyOrUnderutilized)

Minimal NodePool example:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["arm64", "amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-cpu
          operator: In
          values: ["2", "4", "8", "16"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Note the broad instance flexibility — Karpenter picks the cheapest fit, including ARM where allowed.

2. Spot for stateless

Anything stateless and tolerant to a 2-minute eviction notice should run on Spot. That's typically 60-75% off on-demand. With a broad pool of allowed instance types, Spot interruption rates stay low (single-digit percent in eu-west-1 across 2025 in my experience).

Mix Spot + on-demand in the same cluster:

- key: karpenter.sh/capacity-type
  operator: In
  values: ["spot", "on-demand"]
weight: 100  # higher weight for Spot pool

Plus a separate NodePool for workloads tagged "on-demand only" (databases, queue consumers with stateful work).

3. Right-size requests, not just limits

This is the cheapest fix and the one nobody does. Most workloads are wildly over-provisioned at the request level.

Audit:

kubectl top pods --all-namespaces --containers \
  | awk '{print $1, $2, $3, $4}' \
  | sort -k 4 -rh

Compare to deployments:

kubectl get deployments -A -o json \
  | jq -r '.items[] | "\(.metadata.namespace) \(.metadata.name) \(.spec.template.spec.containers[0].resources.requests.cpu) \(.spec.template.spec.containers[0].resources.requests.memory)"'

If actual CPU usage is 5% of requested CPU consistently, the request is wrong. Lower it. Same for memory (with a safety margin — OOM is worse than wasted request).

Tools that automate this: Goldilocks, KRR (Kubernetes Resource Recommender), VPA in recommendation mode. Goldilocks is a good first stop because it surfaces recommendations as a dashboard.

4. Graviton for nodes

Add arm64 to your requirements and rebuild your container images for linux/arm64. Most modern stacks (Node, Python, Go, Java 11+) work without changes. Graviton instances are typically 20% cheaper at equal or better performance.

Add a taint for arm64 and let pods opt in via tolerations during the migration if you want a phased rollout.

5. Cluster consolidation

Many companies run 5-15 EKS clusters when 2-3 would do. Reasons it spreads:

  • "Per-team isolation" — solved with namespaces, NetworkPolicies, and quotas
  • "Per-environment" — usually fine with one staging + one prod
  • "Per-region" — only if you have real multi-region traffic

Each cluster pays for control plane + add-on observability + ingress + Karpenter overhead. Consolidating from 8 clusters to 3 saves ~$2-5k/month before you touch a single workload.

6. Networking and observability

Quick checks I always run:

  • NAT Gateway processing: see the NAT Gateway post. EKS is a major NAT spender.
  • Cross-AZ traffic: enable topology-aware routing on Services
  • CloudWatch Container Insights: check cardinality (see the CloudWatch post)
  • Datadog/New Relic agent: per-pod or per-host pricing — pick what fits your fleet shape

7. Pause non-prod overnight

Dev and staging clusters running 24/7 are paying for nights and weekends. KEDA, kube-downscaler, or a cron Lambda that sets your NodePool count to 0 between 19:00-07:00 cuts non-prod spend ~50% with one day's work.

Real numbers

Recent client (~$11k/month EKS):

  • Cluster Autoscaler → Karpenter (with consolidation): $1,800/month
  • Spot for ~70% of nodes: $2,400/month
  • Resource request right-sizing across 80 deployments: $1,200/month
  • Graviton migration on 3 services: $700/month
  • Non-prod cluster overnight pause: $900/month
  • ECR + Logs VPC endpoints (NAT cuts): $600/month

Final: $3,400/month, ~69% reduction.


If you want help running this on your cluster — including the Karpenter and Spot transition — book a call.