How To Upgrade Production Grade Kubernetes(EKS) Clusters With Zero Downtime
0EKS & Terraform, image credit - Usama Malik - Medium
Prerequisites
- Ensure you have the latest
awscli
,kubectl
, andterragrunt
installed. - Verify access to the AWS account and permissions to modify EKS resources.
- Review AWS upgrade documentation for compatibility concerns.
Upgrade Steps
1. Upgrade EKS Cluster Version
Update the cluster_version
in your Terraform/Terragrunt configuration and apply the changes:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "my-cluster"
cluster_version = "1.25" ~> "1.26"
....
}
terragrunt apply --auto-approve
2. Update Kubernetes Add-ons
After upgrading the cluster, update critical add-ons:
Check the links for the corresponding add-on and upgrade each one accordingly by updating the image tag in the Deployment or DaemonSet, as applicable by using these commands:
kubectl set image daemonset/kube-proxy -n kube-system kube-proxy=<new-image-tag>
kubectl set image deployment/coredns -n kube-system coredns=<new-image-tag>
kubectl set image daemonset/aws-node -n kube-system aws-node=<new-image-tag>
3. Upgrade Managed Node Groups
To avoid downtime, gradually reduce the desired node count to zero and then scale back up:
# Scale down to zero
eks_managed_node_groups = {
default-1.25 = {
min_size = 0
max_size = 0
desired_size = 0
}
default-1.26 = {
min_size = 2
max_size = 5
desired_size = 2
}
}
4. Prepare for Node Upgrade
To prevent disruption, taint all nodes except the one you plan to drain:
kubectl taint node <node> key=value:NoSchedule
5. Drain Nodes and Monitor Autoscaling
Gradually drain nodes and let Karpenter (or another autoscaler) create replacements:
kubectl drain --ignore-daemonsets --delete-emptydir-data --force node/<node>
Monitor new node provisioning and workload stability before continuing with additional nodes.
6. Validate Cluster Health
- Confirm new nodes are in
Ready
state:kubectl get nodes
- Check pod status:
kubectl get pods -A -o wide
- Verify application functionality.
7. Remove Taints
Once all nodes are upgraded and workloads are stable, remove the taints:
kubectl taint node <node> key-
8. Post-Upgrade Cleanup and Verification
- Confirm logs and metrics are normal.
- Ensure there are no pending node terminations.
- Run cluster-wide health checks.
Troubleshooting
- If workloads fail to schedule, check taints and tolerations.
- If add-ons fail to upgrade, review logs and rollback if necessary.
- Use
kubectl describe node <node>
to investigate node issues.