How To Upgrade Production Grade Kubernetes(EKS) Clusters With Zero Downtime

0

EKS & Terraform

EKS & Terraform, image credit - Usama Malik - Medium

Prerequisites

  • Ensure you have the latest awscli, kubectl, and terragrunt installed.
  • Verify access to the AWS account and permissions to modify EKS resources.
  • Review AWS upgrade documentation for compatibility concerns.

Upgrade Steps

1. Upgrade EKS Cluster Version

Update the cluster_version in your Terraform/Terragrunt configuration and apply the changes:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = "my-cluster"
  cluster_version = "1.25" ~> "1.26"
  ....
}
terragrunt apply --auto-approve

2. Update Kubernetes Add-ons

After upgrading the cluster, update critical add-ons:

Check the links for the corresponding add-on and upgrade each one accordingly by updating the image tag in the Deployment or DaemonSet, as applicable by using these commands:

kubectl set image daemonset/kube-proxy -n kube-system kube-proxy=<new-image-tag>
kubectl set image deployment/coredns -n kube-system coredns=<new-image-tag>
kubectl set image daemonset/aws-node -n kube-system aws-node=<new-image-tag>

3. Upgrade Managed Node Groups

To avoid downtime, gradually reduce the desired node count to zero and then scale back up:

# Scale down to zero
eks_managed_node_groups = {
  default-1.25 = {
    min_size     = 0
    max_size     = 0
    desired_size = 0
  }
  default-1.26 = {
    min_size     = 2
    max_size     = 5
    desired_size = 2
  }
}

4. Prepare for Node Upgrade

To prevent disruption, taint all nodes except the one you plan to drain:

kubectl taint node <node> key=value:NoSchedule

5. Drain Nodes and Monitor Autoscaling

Gradually drain nodes and let Karpenter (or another autoscaler) create replacements:

kubectl drain --ignore-daemonsets --delete-emptydir-data --force node/<node>

Monitor new node provisioning and workload stability before continuing with additional nodes.

6. Validate Cluster Health

  • Confirm new nodes are in Ready state:
    kubectl get nodes
    
  • Check pod status:
    kubectl get pods -A -o wide
    
  • Verify application functionality.

7. Remove Taints

Once all nodes are upgraded and workloads are stable, remove the taints:

kubectl taint node <node> key-

8. Post-Upgrade Cleanup and Verification

  • Confirm logs and metrics are normal.
  • Ensure there are no pending node terminations.
  • Run cluster-wide health checks.

Troubleshooting

  • If workloads fail to schedule, check taints and tolerations.
  • If add-ons fail to upgrade, review logs and rollback if necessary.
  • Use kubectl describe node <node> to investigate node issues.
kubernetesupgradeeks 0 0

Leave a comment