AWS Cost Optimization: An SRE's Playbook
Published on
The Bill That Woke Me Up
Three years ago, I got an AWS bill for $2,800. For a side project. That was the moment I stopped treating cloud costs as someone else's problem.
Since then, I've cut that bill to $1,800/month while increasing reliability and capacity. This isn't theoretical — these are the exact techniques I use, with real numbers.
1. Right-Sizing: The Low-Hanging Fruit
Most teams over-provision because they guessed during setup. AWS's own data shows 40% of EC2 instances are under-utilized.
# Check your actual utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890 \
--start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average
Rule of thumb: If your average CPU is below 20% and memory below 40% for 14 days, you can probably drop one instance size. I saved $180/month just by moving a t3.xlarge to t3.large.
2. Reserved Instances and Savings Plans
If you're running anything stable (databases, core services, CI/CD), you're leaving money on the table without commitments.
- 1-year No Upfront RI: ~30% savings on baseline compute
- Compute Savings Plans: ~25% savings with flexibility across instance families
- S3 Intelligent-Tiering: Automatically moves objects between access tiers
I run Savings Plans on my baseline (3x m6i.large) and use Spot for burst workloads. This alone saves me ~$400/month.
3. Spot Instances for Fault-Tolerant Workloads
Spot instances are 60-90% cheaper than On-Demand. The catch: they can be interrupted with 2 minutes notice. Use them for:
- CI/CD build runners
- Data processing jobs
- Stateless web servers behind an ALB
- Kubernetes worker nodes (with proper drain handling)
# Spot Fleet request for CI runners
{
"SpotFleetRequestConfig": {
"AllocationStrategy": "capacityOptimized",
"TargetCapacity": 2,
"LaunchSpecifications": [
{
"InstanceType": "c5.xlarge",
"ImageId": "ami-12345678",
"KeyName": "ci-key",
"SecurityGroups": [{"GroupId": "sg-ci"}]
}
]
}
}
I use Spot for all my CI/CD pipelines. My average build cost dropped from $0.17/hour to $0.04/hour.
4. Storage: Where Costs Hide
Storage costs sneak up on you. EBS volumes, S3 buckets, snapshots — they accumulate silently.
- Delete unattached EBS volumes: Check monthly, automate the cleanup
- Snapshot lifecycle policies: Keep 7 daily, 4 weekly, 12 monthly — not forever
- S3 lifecycle rules: Move to Glacier after 90 days, expire after 365
- EBS gp3 > gp2: Same price, 20% better baseline IOPS, tunable
# Find unattached EBS volumes (often $50-200/month wasted)
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Type:VolumeType}'
5. Network Costs: The Silent Killer
Data transfer is where AWS makes its real money. Inter-AZ traffic, NAT Gateway, and cross-region replication add up fast.
- NAT Gateway vs NAT Instance: A t3.micro NAT instance costs $7/month. A NAT Gateway costs $32/month + data processing fees. For low-traffic environments, use the instance.
- VPC Endpoints: S3 and DynamoDB Gateway endpoints are free. They keep traffic off the NAT.
- Consolidate services in same AZ: Inter-AZ traffic is $0.01/GB each way.
6. Automated Cleanup
Manual cleanup doesn't scale. Set up automation:
# Lambda: Delete old EBS snapshots (older than 90 days)
import boto3
from datetime import datetime, timedelta
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
cutoff = datetime.now() - timedelta(days=90)
snapshots = ec2.describe_snapshots(OwnerIds=['self'])
for snap in snapshots['Snapshots']:
if snap['StartTime'].replace(tzinfo=None) < cutoff:
ec2.delete_snapshot(SnapshotId=snap['SnapshotId'])
print(f"Deleted: {snap['SnapshotId']}")
7. Environment Tiering
Not every environment needs production-grade infrastructure:
- Production: RIs, multi-AZ, reserved capacity
- Staging: Spot + On-Demand mix, single-AZ acceptable
- Dev/Test: Scheduled shutdown (Lambda to stop instances at 8 PM), Spot only, smallest viable sizes
My dev environment auto-stops at night and on weekends. That's a 65% reduction in non-production costs.
The Results
Before optimization:
EC2: $1,200/mo
RDS: $450/mo
S3/EBS: $320/mo
Data Xfer: $280/mo
NAT/Network: $150/mo
Other: $400/mo
Total: $2,800/mo
After optimization:
EC2 (Spot+RI): $680/mo
RDS (RI): $310/mo
S3 (Lifecycle): $180/mo
Data Xfer: $120/mo
NAT (VPC EP): $45/mo
Other: $265/mo
Total: $1,800/mo
Savings: $1,000/month (36%)
Where to Start
- Week 1: Enable AWS Cost Explorer. Identify your top 5 cost drivers.
- Week 2: Right-size your instances. Delete unattached volumes.
- Week 3: Set up Spot for non-production workloads.
- Week 4: Purchase Savings Plans for your baseline.
- Ongoing: Monthly cost review. Automate cleanup.
Cost optimization isn't a one-time project. It's an ongoing discipline. Start with the biggest wins (right-sizing + Spot), then layer on the refinements.
Questions about your AWS bill? Get in touch.
$ subscribe --to newsletter
SRE tips, infrastructure patterns, and NixOS guides — straight to your inbox. No spam, just signal.
Delivery via newsletter service. Unsubscribe anytime.