Skip to content

Invisible backlog of queued work #706

@kishiel

Description

@kishiel

Problem

The external-attacher maintains an invisible backlog of work which is not revealed in its logs or in an emitted metric

So what

  • Service operators do not have any indication that they have misconfigured the external-attacher
  • The backlogs result in an inability to complete the volume attach/detach lifecycle within a reasonable amount of time
  • Stateful services performing deployments experience reduced availability for extended periods of time

Proposal

Emit metrics and/or logs about the backlog and progress against the backlog by default

What happened

My service began to observe some pods taking 45 minutes to come back to active despite being successfully rescheduled. Everything appeared fine except that the volumes weren't attached. At first it was just one or two, and then it was hundreds. Here's a timeline for an example where a detached volume was not attached to the new node for 27 minutes:

06:36:22 — Old node unmounts volume
06:36:24 — kube-controller-manager started detach from old node
06:38:24 — KCM started attach attempt on new node indicating the pod had been successfully rescheduled (volume still attached to old node)
06:44:42 — Last log with "volume attachment is being deleted" in KCM
06:45:44 — CSI controller - logged detaching
06:45:59 — CSI controller - detach completed from old node
< 27 minute gap >
07:13:30 — CSI controller attached volume to new node
07:13:32 — CSI controller attach completed

You'll notice that there aren't any logs in this timeline about the external-attacher because by default nothing was emitting that indicating it was doing anything (or waiting). It took the help of EKS, EBS, and EC2 to pinpoint the problem because the log entries across all involved components were essentially barren with respect to the volume-ids, pvcs, or the pod. Nothing was happening and nobody knew why.

The fix ended up being that I needed to increase the worker-threads configuration from 10 to 1,000 in order to remove the bottleneck from external-attacher. It had built up a queue of thousands of volumes and it wasn't signaling that it was underwater. This configuration is well documented in the ebs-csi-driver readme for large-scale clusters, but it requires you to be psychically aware that the problem exists in the external-attacher in the first place, or what large-scale means. The default workers has recently been increased to 100, which will be plenty for most service owners, but my problem didn't go away even after I'd increased from 10 to 300. Service owners at every scale need a way to help calibrate the external-attacher's throughput.

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions