Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

SSajjadHH · 2025-01-09T19:38:28Z

Description

Observed Behavior:

Recently, we observed unexpected behavior with Karpenter. A deployment had resource requirements that didn't match any of the defined node pools, causing the associated pods to remain in a pending state for an extended period. During this time, Karpenter logs were filled with errors like the following:

{
   "level":"ERROR",
   "time":"2025-01-06T17:28:06.957Z",
   "logger":"controller",
   "message":"could not schedule pod",
   "commit":"a2875e3",
   "controller":"provisioner",
   "namespace":"",
   "name":"",
   "reconcileID":"3097aa7a-aea6-4b11-9484-fb71c6b160f3",
   "Pod":{
      "name":"someservice-7dc7dbb7df-ffdfc",
      "namespace":"someservice"
   },
   "error":"incompatible with nodepool \"karpenter-default-prometheus-13\", daemonset overhead={\"cpu\":\"370m\",\"memory\":\"576Mi\",\"pods\":\"8\"}, did not tolerate platform.ourorg.com/nodetype=prometheus:NoSchedule; incompatible with nodepool \"karpenter-default-al2023-13\", daemonset overhead={\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"10\"}, no instance type satisfied resources {\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"11\"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-size NotIn [24xlarge 32xlarge 48xlarge large medium and 5 others], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [karpenter-default-al2023-13], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], platform.ourorg.com/nodetype In [al2023-default], topology.istio.io/subzone In [devcluster], topology.kubernetes.io/zone In [us-west-2a us-west-2b us-west-2c] (no instance type met all requirements)",
   "errorCauses":[
      {
         "error":"incompatible with nodepool \"karpenter-default-prometheus-13\", daemonset overhead={\"cpu\":\"370m\",\"memory\":\"576Mi\",\"pods\":\"8\"}, did not tolerate platform.ourorg.com/nodetype=prometheus:NoSchedule"
      },
      {
         "error":"incompatible with nodepool \"karpenter-default-al2023-13\", daemonset overhead={\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"10\"}, no instance type satisfied resources {\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"11\"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-size NotIn [24xlarge 32xlarge 48xlarge large medium and 5 others], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [karpenter-default-al2023-13], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], platform.ourorg.com/nodetype In [al2023-default], topology.istio.io/subzone In [devcluster], topology.kubernetes.io/zone In [us-west-2a us-west-2b us-west-2c] (no instance type met all requirements)"
      }
   ]
}

As a result, the total number of nodes in the cluster grew significantly, reaching about 124 nodes (Highly unusual for our workload), with most of those extremely underutilized. The pending pods and the associated errors caused Karpenter to continue provisioning new nodes (this is our understanding), which led to inefficient resource usage and cost.

Steps Taken: To mitigate the situation, we scaled down the said deployment to 0, which effectively removed the pending pods. This action had an immediate effect on Karpenter, the logs cleared, and Karpenter started to delete and consolidate nodes. Eventually, the number of nodes was reduced to around 60 with healthy utilization.

Expected Behavior:
Deployments with resource requirements that do not match any node pool should be allowed to remain in a pending state without causing issues with node provisioning for other workloads. We would expect Karpenter to handle resource mismatches without triggering unnecessary node provisioning. A pending deployment should not block Karpenter from consolidating existing nodes or provisioning nodes for other workloads.

Actual Behavior: The pending deployment caused a blockage in Karpenter's node consolidation process, forcing Karpenter to provision new nodes for any incoming workloads, even when they weren't necessary.

Reproduction Steps (Please include YAML):
Deployments that don't match any of the defined nodepools.

Versions:

Chart Version: 1.0.8
EKS Server Version: v1.25.16

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

SSajjadHH added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

SSajjadHH commented Jan 9, 2025

Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

Comments

SSajjadHH commented Jan 9, 2025

Description