You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We observed that sometimes nodes using AWS EFS CSI driver fail to clean efs.csi.aws.com/agent-not-ready:NoExecute taint due to a transient driver failure (not exactly sure why, but it happens rarely). In such case, the pod nevers get scheduled to the new node due to the taint. Karpenter does not provision a replacement node, nor does it deprovision the stuck node believing that it's fit for scheduling.
Here is a sequence of events:
2025-01-14T01:17:15Z [Warning] 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
2025-01-14T01:17:16Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6
2025-01-14T01:22:36Z [Warning] 0/4 nodes are available: 1 node(s) had untolerated taint {efs.csi.aws.com/agent-not-ready: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2025-01-14T01:25:46Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6, node/ip-10-64-47-242.ap-southeast-1.compute.internal
2025-01-14T01:26:30Z [Warning] skip schedule deleting pod: REDACTED
Logs from Karpenter pod (contiguous block):
{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"found provisionable pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","Pods":"REDACTED","duration":"96.585701ms"}
{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2025-01-14T01:17:16.807Z","logger":"controller","message":"created nodeclaim","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","NodePool":{"name":"lab-cpu"},"NodeClaim":{"name":"lab-cpu-lm7x6"},"requests":{"cpu":"3770m","memory":"26864Mi","pods":"6"},"instance-types":"m7i-flex.2xlarge, m7i-flex.4xlarge, m7i-flex.8xlarge"}
{"level":"INFO","time":"2025-01-14T01:17:18.726Z","logger":"controller","message":"launched nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"a93f070a-159e-4056-9161-36fe5d8ed913","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","instance-type":"m7i-flex.2xlarge","zone":"ap-southeast-1a","capacity-type":"on-demand","allocatable":{"cpu":"7910m","ephemeral-storage":"89Gi","memory":"29317Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2025-01-14T01:17:38.107Z","logger":"controller","message":"registered nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"b8918fbe-0fb0-4e98-acce-93a705a40d41","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"}}
{"level":"INFO","time":"2025-01-14T01:47:30.925Z","logger":"controller","message":"tainted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b8caa83e-b903-402e-898e-fb3e3166b82a","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}
{"level":"INFO","time":"2025-01-14T01:47:31.982Z","logger":"controller","message":"deleted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b343231e-b57c-444c-85c3-76279475975e"}
{"level":"INFO","time":"2025-01-14T01:47:32.219Z","logger":"controller","message":"deleted nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"eacb1ad5-f4ff-4563-9180-2b07558c7f68","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2"}
It seems Karpenter does not detect the fact that pod placement has failed, so it does not create a new nodeclaim, and the pod still earmarked to be scheduled for this node. The node eventually gets disrupted and deprovisioned after 30 minutes (something else was probably scheduled on it, I'm still going through logs).
Edit: the node was manually deleted in EC2 console by a team member. According to kube-scheduler logs, nothing else was scheduled on it, so it seems Karpenter didn't disrupt the node after it was empty despite the fact that the nodepool is configured like this:
In the face of possible transient failures to node initialization, is there a way to de-provision nodes that do not clear a startup taint after some time, so that Karpenter can try to provision a suitable node again?
Is there a way to troubleshoot this sort of situations further should they happen again?
We observed that sometimes nodes using AWS EFS CSI driver fail to clean
efs.csi.aws.com/agent-not-ready:NoExecute
taint due to a transient driver failure (not exactly sure why, but it happens rarely). In such case, the pod nevers get scheduled to the new node due to the taint. Karpenter does not provision a replacement node, nor does it deprovision the stuck node believing that it's fit for scheduling.Here is a sequence of events:
Logs from Karpenter pod (contiguous block):
It seems Karpenter does not detect the fact that pod placement has failed, so it does not create a new nodeclaim, and the pod still earmarked to be scheduled for this node.
The node eventually gets disrupted and deprovisioned after 30 minutes (something else was probably scheduled on it, I'm still going through logs).Edit: the node was manually deleted in EC2 console by a team member. According to kube-scheduler logs, nothing else was scheduled on it, so it seems Karpenter didn't disrupt the node after it was empty despite the fact that the nodepool is configured like this:
My questions to that:
Context:
The text was updated successfully, but these errors were encountered: