-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The Redis pod depends on a PersistentVolume which is currently bound to an EBS volume in the us-east-1c Availability Zone. I'm not sure if this volume was created by hand (I don't see it declared anywhere in this repo, but maybe it's implicit). The volume persists between cluster redeploys (not sure if that's strictly needed either; do we keep any data in Redis that can't be discarded?). Anyway, in order for Kubernetes to successfully place the Redis pod, the cluster needs to have at least one worker node in us-east-1c.
But there's nothing in our Terraform config to guarantee that, and with three AZs in us-east-1 and only two nodes in our cluster, it's quite possible for EKS to deploy in a configuration where the EBS volume can't be attached to any of the nodes, and therefore Redis can't start. Without Redis, there's no caching of API responses, and so the backend tends to become very slow.
I attempted to fix this in #62; by using three t3.large nodes instead of two t3.xlarge, I was hoping to guarantee that at least one node was placed in us-east-1c. But my fix didn't deploy successfully because the app pod requests more CPU cores than are available on a t3.large after other pods have been placed. Rather than fiddle with the Helm configuration I reverted the commit in #63 to restore service.
Ironically, the new Terraform deployment triggered by the revert did happen to place a worker node in us-east-1c, so Redis is back up and running and the backend is operating normally. But this issue isn't fixed yet, and is likely to bite again the next time someone changes the cluster configuration in a way that requires deploying new nodes (e.g. updating to a new k8s version or AMI).