-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume still hang on Karpenter Node Consolidation/Termination #1955
Comments
I updated to karpenter 0.35 and using AL2023 image, the problem still happen |
some update: it seemly caused by elasticsearch statefulset, time to terminate es pod is so slow, it's possible that the driver pod be killed before es pod killed then the volume not released |
update: when I set |
Nope, in another try pv still stuck 🤦 |
Hey there @levanlongktmt thank you for reporting this. It appears your spot instances are being ungracefully terminated. Take a look at the documentation for guidance on navigating this issue: 6-Minute Delays in Attaching Volumes - What steps can be taken to mitigate this issue? If you are still running into delays but have already configured the Kubelet and enabled interruption handling in Karpenter, please let us know. Thanks! |
@torredil ops... my user data missed last line
let me try with it |
hey @torredil I tried but still no luck :(
after graceful shutdown of elasticsearch pod, the pod |
I will try set |
Still no luck, volume still be stuck in 6 minutes and prestop hook not working 😭 |
@levanlongktmt Thanks for the followup info. Something that came to mind here is that it might be possible for newer versions of Karpenter to be incompatible with the pre-stop lifecycle hook due to this change: kubernetes-sigs/karpenter#508. You'll notice that the LCH won't run if the following taint is not present on the node during a termination event:
We'll take a closer look and report back with more details or a fix. Thank you! |
@torredil here is logs of karpenter and elasticsearch pod, so as I see
Logs of Karpenter
Logs of elasticsearch pod
|
@torredil do you think this quick and dirty fix will works 😆? |
@torredil seemly k8s not call |
@torredil any good news for this 😀? |
Hi, sorry about the wait - your issue is probably caused by what #1969 solves - Karpenter changed the taints they used when draining nodes and our LCH needs to be changed to account for it. That fix should be available in the next release of the EBS CSI Driver, expected to happen later this month. |
Amazing, thank so much @ConnorJC3 😍 |
@ConnorJC3 I was trying to understand from which version of karpenter this was changed , i was about to upgrade my csi driver to address the fix but i guess i will wait for the next release. I see kubernetes-sigs/karpenter#508 was merged in October 🤔 |
@primeroz here is list of karpenter versions affected |
v1.29.0 has been released which contains the fix. |
I just upgraded my dev cluster to
|
@primeroz do you have the logs from the driver, when your node was disrupted? (also, would be useful for you to increase your log level while collecting these) |
@alexandermarston there are no events whatsoever when i delete the node in the ebs csi pod running on that node I can see the Note in my case i am deleting the node to get it replaced by karpenter.
|
@primeroz what exactly do you mean by "delete a node"? Assuming you mean something like deleting the node via the AWS Console - that sounds like ungraceful termination, which will not run pre-stop hooks as it results in an immediate and uninterruptible shutdown. |
@ConnorJC3 I mean Anyway i am testing now with a karpenter disruption by adding an collecting logs on next node termination |
OK, I've been able to test this by manually tainting a node and then running the preStop hook with:
The preStop hook is doing everything it should, from what I understand. Again, from my limited understanding, I imagine the issue is that your service (which is using the volume) is taking longer than the You could either try increasing the terminationGracePeriod of the EBS CSI Driver or lower the grace period of your service. |
This is entirely possible, i am testing indeed with a service with an I will test this 🙏 |
Increasing the Also what's interesting, to me at least, is that in the logs of the ebs csi node i never see I know for sure that karpenter does not I have been |
You won't see the logs from the preStop hook handler, as they are not logged in the same way normal pod stdout is. You will be able to see if the preStop hook failed in the Kubernetes Events though. Can you try deleting the node again, but manually execute the preStop hook and share the output?
|
@AndrewSirenko thanks a lot for the detailed exlanation , one question
Do you have any link to the |
@primeroz Apologies, I have edited my statement. I was referencing the ability to disable the 6 minute force detach timeout which you can see the details of here that was added in Kubernetes 1.30 (off by default). According to SIG Storage planning sheet, this would not be enabled by default until atleast Kubernetes 1.32 (and most likely later than that if this general issue is not fixed at the upstream Kubernetes level). |
Seemly this is what happened with me, I did 2 quick test
Second test: Interupt the spot node with 1 ES pod, the
It seems somehow the node has been deleted before all pods terminate (from aws or karpenter) and make 6+ min delay |
If your pod is stuck in terminating there is nothing we can do, because volumes are not unmounted until all pods using them are terminated. You would need to work out whatever is preventing the pod from terminating correctly and fix that issue. |
I've also been doing some testing tonight to validate the workaround from the Karpenter side. What I've done is:
I whipped up a simple test where I provisioned a stateful set with a prestop hook, waited for the Karpenter provisioned node to come ready and the pod to begin running, and then deleted the Karpenter node. This kicks off Karpenter's termination flow and the drain begins. I expect this to work; the workload pods being draining before the CSI driver and the CSI driver has a greater Where it gets weird is the StatefulSet pods are left in a terminating state well after their pre-stop hook and I'm honestly not familiar enough with the Kubernetes internals around StatefulSet termination and CSI drivers to understand why this fixes the problem, but with this change the suggested mitigation strategy appears to work consistently in my testing. Would love to know if either of you have any insights here @ConnorJC3 @AndrewSirenko. |
Maybe I'm missing something, but by default We are using STS/Deployments with EBS and after setting Or will be the functionaly of all nodes in cluster will remain same? I mean any sts/deployment on any node can claim EBS and free ebs without any issue, even on nodes with taints? And Finally, can someone explain me how this simple change solve this issue? Because during node draining daemon set pods are ignored, so why volume deattaching not works when ebs-node pod is deployed to to all nodes, no matter which taint have? |
Regarding Regarding preStop hook. For graceful node shutdown or calling eviction API to stop EBS CSI node DaemonSet pod invokes the preStop hook, which cleans up volumes, see FAQ |
If we use default value (tolerateAllTaints: true), we can easily mount EBS volume on any node we have. Why volume reuse cannot work with default values?
This is not explaining us why volume reuse starts working, if we run ebs-driver daemon set only on specific nodes (nodes without taints or nodes with specific taints, defined in tolerations for ebs-driver) Why volume reuse cannot work with default values - ebs-driver daemon set is deployed on any node with any taint. |
to my understanding this is because of the combination of During So with the default values.yaml the If the pre-stop hook does not run the needed cleanup does not happen and you have the 6 minutes problem |
@ConnorJC3 @jmdeal how can I set Are you mean the
|
It's not currently surfaced as part of the helm chart or addon configuration. I set it by patching the daemonset after installing the addon. kubectl patch ds -n kube-system ebs-csi-node --patch "$(cat <<EOF
spec:
template:
spec:
terminationGracePeriodSeconds: ...
EOF
)" By setting the
@primeroz's summary was spot on. By setting |
I have been wondering why we need this but could not understand it if i look at karpenter termination code https://github.com/kubernetes-sigs/karpenter/blob/5bc07be72a1acd553a2a692edd27e79c20e0e1c1/pkg/controllers/node/termination/terminator/terminator.go#L118-L130 I will evict in order ( as long as they don't tolerate the
So i would think it would
So , unless you are running StatefulWorkloads that are |
Currently Karpenter does not wait for one set of pods to complete termination before moving onto the next set. So long as all of the pods in the previous set have began terminating eviction of the next set can begin. If we changed this behavior and waited for the previous set to complete termination then you would right, there shouldn't be a need to configure |
Some additional information, in my cluster (EKS v1.30, Karpenter v0.37), when k8s envict pod of statefulset, I randomly see the pod stay in |
Hi folks, I have just posted a draft Request For Comment on Karpenter + EBS CSI Driver teams will hopefully decide on which solutions we are moving forward via this RFC. |
You will be able to increase EBS CSI Driver Node Pod Thanks to @ElijahQuinones for #2060
At the moment, we cannot add a sleep pre-stop hook to |
@jmdeal FYI we may be able to avoid needing to add a sleep pre-stop hook to This would increase robustness of our pre-stop lifecycle hook. Our team will look into this. Thanks to Connor for mentioning this feature. |
I was made aware of this issue and that we have some workloads (statefulsets with pvs) hitting this with Karpenter v0.37, EKS 1.28, ebs-csi v1.32.0 The behavior that would be ideal is that if the ebs-csi controller detected the node was de-provisioned, and force-detached any remaining volumes that escaped cleanup of the deamonset driver pod. Most of the suggestions on this issue are "hacks" or workarounds to try and ensure the daemonset driver pod can always succeed at cleanup. This seems like a mistake to me, as the controller should have that responsibility to clean up volumes if the node is permanently destroyed. However, when I started looking into how this might be implemented, I discovered this is a gap in the csi spec: container-storage-interface/spec#512 Has AWS considered reaching out to the SIG-node folks or the authors of that original issue? There seems to have been a push in late 2022 to fix the issue that did not cross the finish line: container-storage-interface/spec#512 (comment) & container-storage-interface/spec#477 |
EDIT: I'm not 100% certain this was related to Karpenter. We found that another operator in our cluster might have been the culprit here, creating a pod disruption budget that meant that a node running the operator's pod was never allowed to be killed, so please disregard this. |
We run |
I'm also facing this same issue when shifting my STS workloads from one node to another. It takes almost 6 minutes to resolve the multi-attach volume error. To reduce this time, I manually deleted the volumeattachments of the respective PVC used in sts, which then allows the volume to attach to the new node without delay. Is there a more effective way to address this issue when using Karpenter as the node autoscaler? |
I can confirm this issue is fixed with the new release of Karpenter v1.0.0, which included the following change: kubernetes-sigs/karpenter#1294 There are no 6+ or 2+ minute delays for disrupted stateful workloads when they are starting on a new node! Possible exceptions in v1.0.0:
You can validate that this fix works by performing the following actions:
no-stateful-disruption-delay.yaml
Big thanks to @jmdeal and @jonathan-innis for getting this across the finish line for v1.0.0! CC @primeroz @iamhritik @seizethedave @johnjeffers @levanlongktmt @youwalther65 @StepanS-Enverus |
@cnmcavoy let me sync with you on the Kubernetes slack channel about this early next week. I agree that an optional force-detach-orphaned-volumes mode would be a preferable long-term solution to auto-scalers like Karpenter being aware of this issue. Traditionally EBS has been very against any type of actual EBS Alternatively we have floated the idea of a clearer SIG-Node definition of node draining that would be aware of volume-attachment objects. Sorry for the delay on this response, been prioritizing the short-term fix. Thank you for your patience. |
@AndrewSirenko amazing 😍 let me try |
/close Quite a few customers have reported that upgrading to Karpenter v1.0.0 resolves their issues. I have updated our FAQ with this information. If you run into further delayed attachment issues please open a new issue (because this one now contains out-of-date information) Thank you! |
@AndrewSirenko: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind bug
What happened?
As discussed at #1665, @torredil said it's fixed in v1.27 (#1665 (comment)) but we still got problem with v1.28
What you expected to happen?
How to reproduce it (as minimally and precisely as possible)?
Multi-Attach error for volume "pvc-xxxxx-xxxxx-xxx" Volume is already exclusively attached to one node and can't be attached to another
ebs-csi-controller
Anything else we need to know?:
I setup csi driver using eks add-on
Environment
kubectl version
):v1.28.0-eksbuild.1
The text was updated successfully, but these errors were encountered: