Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter Node Provisioning Blocked by Pending Deployment with Resource Mismatch #7578

Open
SSajjadHH opened this issue Jan 9, 2025 · 0 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@SSajjadHH
Copy link

Description

Observed Behavior:

Recently, we observed unexpected behavior with Karpenter. A deployment had resource requirements that didn't match any of the defined node pools, causing the associated pods to remain in a pending state for an extended period. During this time, Karpenter logs were filled with errors like the following:

{
   "level":"ERROR",
   "time":"2025-01-06T17:28:06.957Z",
   "logger":"controller",
   "message":"could not schedule pod",
   "commit":"a2875e3",
   "controller":"provisioner",
   "namespace":"",
   "name":"",
   "reconcileID":"3097aa7a-aea6-4b11-9484-fb71c6b160f3",
   "Pod":{
      "name":"someservice-7dc7dbb7df-ffdfc",
      "namespace":"someservice"
   },
   "error":"incompatible with nodepool \"karpenter-default-prometheus-13\", daemonset overhead={\"cpu\":\"370m\",\"memory\":\"576Mi\",\"pods\":\"8\"}, did not tolerate platform.ourorg.com/nodetype=prometheus:NoSchedule; incompatible with nodepool \"karpenter-default-al2023-13\", daemonset overhead={\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"10\"}, no instance type satisfied resources {\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"11\"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-size NotIn [24xlarge 32xlarge 48xlarge large medium and 5 others], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [karpenter-default-al2023-13], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], platform.ourorg.com/nodetype In [al2023-default], topology.istio.io/subzone In [devcluster], topology.kubernetes.io/zone In [us-west-2a us-west-2b us-west-2c] (no instance type met all requirements)",
   "errorCauses":[
      {
         "error":"incompatible with nodepool \"karpenter-default-prometheus-13\", daemonset overhead={\"cpu\":\"370m\",\"memory\":\"576Mi\",\"pods\":\"8\"}, did not tolerate platform.ourorg.com/nodetype=prometheus:NoSchedule"
      },
      {
         "error":"incompatible with nodepool \"karpenter-default-al2023-13\", daemonset overhead={\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"10\"}, no instance type satisfied resources {\"cpu\":\"570m\",\"memory\":\"1088Mi\",\"pods\":\"11\"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-hypervisor In [nitro], karpenter.k8s.aws/instance-size NotIn [24xlarge 32xlarge 48xlarge large medium and 5 others], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [karpenter-default-al2023-13], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], platform.ourorg.com/nodetype In [al2023-default], topology.istio.io/subzone In [devcluster], topology.kubernetes.io/zone In [us-west-2a us-west-2b us-west-2c] (no instance type met all requirements)"
      }
   ]
}

As a result, the total number of nodes in the cluster grew significantly, reaching about 124 nodes (Highly unusual for our workload), with most of those extremely underutilized. The pending pods and the associated errors caused Karpenter to continue provisioning new nodes (this is our understanding), which led to inefficient resource usage and cost.

Steps Taken: To mitigate the situation, we scaled down the said deployment to 0, which effectively removed the pending pods. This action had an immediate effect on Karpenter, the logs cleared, and Karpenter started to delete and consolidate nodes. Eventually, the number of nodes was reduced to around 60 with healthy utilization.

Expected Behavior:
Deployments with resource requirements that do not match any node pool should be allowed to remain in a pending state without causing issues with node provisioning for other workloads. We would expect Karpenter to handle resource mismatches without triggering unnecessary node provisioning. A pending deployment should not block Karpenter from consolidating existing nodes or provisioning nodes for other workloads.

Actual Behavior: The pending deployment caused a blockage in Karpenter's node consolidation process, forcing Karpenter to provision new nodes for any incoming workloads, even when they weren't necessary.

Reproduction Steps (Please include YAML):
Deployments that don't match any of the defined nodepools.

Versions:

  • Chart Version: 1.0.8
  • EKS Server Version: v1.25.16
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@SSajjadHH SSajjadHH added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

1 participant