Problem
The external-attacher maintains an invisible backlog of work which is not revealed in its logs or in an emitted metric
So what
- Service operators do not have any indication that they have misconfigured the external-attacher
- The backlogs result in an inability to complete the volume attach/detach lifecycle within a reasonable amount of time
- Stateful services performing deployments experience reduced availability for extended periods of time
Proposal
Emit metrics and/or logs about the backlog and progress against the backlog by default
What happened
My service began to observe some pods taking 45 minutes to come back to active despite being successfully rescheduled. Everything appeared fine except that the volumes weren't attached. At first it was just one or two, and then it was hundreds. Here's a timeline for an example where a detached volume was not attached to the new node for 27 minutes:
06:36:22 — Old node unmounts volume
06:36:24 — kube-controller-manager started detach from old node
06:38:24 — KCM started attach attempt on new node indicating the pod had been successfully rescheduled (volume still attached to old node)
06:44:42 — Last log with "volume attachment is being deleted" in KCM
06:45:44 — CSI controller - logged detaching
06:45:59 — CSI controller - detach completed from old node
< 27 minute gap >
07:13:30 — CSI controller attached volume to new node
07:13:32 — CSI controller attach completed
You'll notice that there aren't any logs in this timeline about the external-attacher because by default nothing was emitting that indicating it was doing anything (or waiting). It took the help of EKS, EBS, and EC2 to pinpoint the problem because the log entries across all involved components were essentially barren with respect to the volume-ids, pvcs, or the pod. Nothing was happening and nobody knew why.
The fix ended up being that I needed to increase the worker-threads configuration from 10 to 1,000 in order to remove the bottleneck from external-attacher. It had built up a queue of thousands of volumes and it wasn't signaling that it was underwater. This configuration is well documented in the ebs-csi-driver readme for large-scale clusters, but it requires you to be psychically aware that the problem exists in the external-attacher in the first place, or what large-scale means. The default workers has recently been increased to 100, which will be plenty for most service owners, but my problem didn't go away even after I'd increased from 10 to 300. Service owners at every scale need a way to help calibrate the external-attacher's throughput.
Problem
The external-attacher maintains an invisible backlog of work which is not revealed in its logs or in an emitted metric
So what
Proposal
Emit metrics and/or logs about the backlog and progress against the backlog by default
What happened
My service began to observe some pods taking 45 minutes to come back to active despite being successfully rescheduled. Everything appeared fine except that the volumes weren't attached. At first it was just one or two, and then it was hundreds. Here's a timeline for an example where a detached volume was not attached to the new node for 27 minutes:
You'll notice that there aren't any logs in this timeline about the external-attacher because by default nothing was emitting that indicating it was doing anything (or waiting). It took the help of EKS, EBS, and EC2 to pinpoint the problem because the log entries across all involved components were essentially barren with respect to the volume-ids, pvcs, or the pod. Nothing was happening and nobody knew why.
The fix ended up being that I needed to increase the
worker-threadsconfiguration from 10 to 1,000 in order to remove the bottleneck from external-attacher. It had built up a queue of thousands of volumes and it wasn't signaling that it was underwater. This configuration is well documented in the ebs-csi-driver readme for large-scale clusters, but it requires you to be psychically aware that the problem exists in the external-attacher in the first place, or what large-scale means. The default workers has recently been increased to 100, which will be plenty for most service owners, but my problem didn't go away even after I'd increased from 10 to 300. Service owners at every scale need a way to help calibrate the external-attacher's throughput.