AWS Cost Optimization: Spot Instances, Reserved Capacity, and Not Burning Money
I just finished a cost review for a client and we found about 40% savings on their EC2 bill. Most of it was low-hanging fruit — on-demand instances running 24/7 that should have been reserved, and stateless workloads on expensive on-demand that could be running on Spot for a fraction of the cost. This is embarrassingly common. Here's the framework I use.
The Three Purchasing Models
AWS gives you three ways to pay for EC2 capacity, and picking the wrong one is how you end up with a CFO asking uncomfortable questions.
On-Demand — full price, no commitment, available immediately. Use this for: unpredictable workloads, short-term experiments, anything that can't tolerate interruption and doesn't run long enough to justify commitment. It's the most expensive option per hour and the right choice less often than people think.
Reserved Instances (RIs) — you commit to a 1 or 3-year term in exchange for up to 72% off on-demand pricing. Standard Reserved Instances lock you to a specific instance type and AZ. Convertible RIs give you flexibility to change instance type/OS at the cost of a smaller discount (around 54%). Use RIs for: your baseline steady-state capacity — things that are always on, like your RDS instances, NAT gateways, and the core nodes in your Kubernetes cluster that run system workloads.
Spot Instances — you bid on unused EC2 capacity at up to 90% off on-demand. The catch: AWS can reclaim a Spot instance with 2 minutes notice when they need the capacity back. Use Spot for: stateless, interruption-tolerant workloads. Kubernetes worker nodes running batch jobs, CI/CD build agents, data processing pipelines. If your app can handle a node disappearing gracefully, Spot is basically free money.
The mistake I see most often: running everything on-demand because "we might need flexibility." You don't need flexibility on your production database that's been running the same instance type for two years. Buy the RI.
Spot Instance Strategy: Don't Be Naive About It
The wrong way to use Spot is to pick one instance type in one AZ and call it done. When AWS reclaims that capacity, your pool is empty and your workload stalls.
The right approach is diversification. Use multiple instance types across multiple AZs. If m4.xlarge in us-east-1a gets reclaimed, your fleet falls back to m5.xlarge in us-east-1b. AWS's Spot pricing fluctuates by instance type and AZ independently, so diversification both reduces interruption risk and helps you find the cheapest capacity at any given moment.
For Kubernetes worker nodes, I use an Auto Scaling Group with a Mixed Instances Policy. Here's the relevant Terraform-ish config as a CloudFormation/CLI representation:
{
"AutoScalingGroupName": "k8s-workers-spot",
"MixedInstancesPolicy": {
"InstancesDistribution": {
"OnDemandBaseCapacity": 2,
"OnDemandPercentageAboveBaseCapacity": 0,
"SpotAllocationStrategy": "diversified",
"SpotInstancePools": 4
},
"LaunchTemplate": {
"LaunchTemplateSpecification": {
"LaunchTemplateName": "k8s-worker",
"Version": "$Latest"
},
"Overrides": [
{"InstanceType": "m4.xlarge"},
{"InstanceType": "m5.xlarge"},
{"InstanceType": "m4.2xlarge"},
{"InstanceType": "m5.2xlarge"},
{"InstanceType": "r4.xlarge"}
]
}
},
"MinSize": 3,
"MaxSize": 20,
"DesiredCapacity": 6
}
OnDemandBaseCapacity: 2 means the first 2 instances in the group are always on-demand. Good for ensuring you have a floor of stable capacity. OnDemandPercentageAboveBaseCapacity: 0 means everything above that floor is Spot. SpotAllocationStrategy: diversified spreads instances across all the pools you defined.
You can apply this via the AWS CLI:
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name k8s-workers-spot \
--mixed-instances-policy file://mixed-instances-policy.json \
--vpc-zone-identifier "subnet-aaa,subnet-bbb,subnet-ccc" \
--min-size 3 \
--max-size 20 \
--desired-capacity 6
Handling the 2-Minute Interruption Warning
When AWS decides it needs your Spot instance back, it sends an interruption notice via the instance metadata service. You have about 2 minutes before the instance gets terminated. That's enough time to drain gracefully if you've set things up correctly.
For Kubernetes, this is where the AWS Node Termination Handler comes in. It watches the instance metadata endpoint and, when it sees an interruption notice, cordons the node and evicts pods before the axe falls. Install it as a DaemonSet:
kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.0.0/all-resources.yaml
For non-Kubernetes workloads, you can poll the metadata endpoint yourself:
# Check for interruption notice - returns 404 if no notice, 200 if terminating
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" \
-H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
curl -s -o /dev/null -w "%{http_code}" \
-H "X-aws-ec2-metadata-token: $TOKEN" \
http://169.254.169.254/latest/meta-data/spot/termination-time
Run this every 5 seconds in a cron or systemd timer. If you get a 200, start your drain process.
Practical Kubernetes Recommendations
Here's the pattern I actually use:
- System node pool: 3x on-demand Reserved Instances (1-year,
m5.large). These run kube-system, monitoring, and anything withpriorityClass: system-cluster-critical. Never Spot. You do not want your metrics stack evicted. - General workload pool: Mixed instances ASG as above, 80%+ Spot. Set
PodDisruptionBudgeton your deployments so evictions don't take down more than N-1 replicas at a time. - Batch/CI pool: 100% Spot, can scale to zero.
cluster-autoscalerhandles scaling. If the pool hits zero during a Spot shortage, jobs queue. Not ideal but acceptable for batch.
With this setup and 1-year RIs on the on-demand nodes, I typically see:
- Spot nodes: ~70% savings vs on-demand
- Reserved on-demand nodes: ~40% savings vs on-demand
- Blended across the cluster: somewhere around 55-65% off the equivalent all-on-demand bill
The Part People Skip
Before you buy RIs, look at your actual utilization. AWS Cost Explorer has a "Reserved Instance Recommendations" feature that analyzes your last 7/30/60 days of on-demand usage and suggests what to reserve. Use it. Buying the wrong RIs because you eyeballed it is how you end up locked into c3.large when you've actually migrated to Graviton.
Also: check your Savings Plans. They're more flexible than RIs and can cover Lambda and Fargate too. But that's a whole other post.