From 61ec86fac0220eda2e2ca7f8f9f2b81da3e41a6f Mon Sep 17 00:00:00 2001 From: Morten Torkildsen Date: Fri, 30 May 2025 21:23:26 +0000 Subject: [PATCH 1/5] KEP-5194: Initial version for DRA ReservedFor Workloads --- keps/prod-readiness/sig-scheduling/5194.yaml | 3 + .../5194-reserved-for-workloads/README.md | 796 ++++++++++++++++++ .../5194-reserved-for-workloads/kep.yaml | 50 ++ 3 files changed, 849 insertions(+) create mode 100644 keps/prod-readiness/sig-scheduling/5194.yaml create mode 100644 keps/sig-scheduling/5194-reserved-for-workloads/README.md create mode 100644 keps/sig-scheduling/5194-reserved-for-workloads/kep.yaml diff --git a/keps/prod-readiness/sig-scheduling/5194.yaml b/keps/prod-readiness/sig-scheduling/5194.yaml new file mode 100644 index 00000000000..7654598b151 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5194.yaml @@ -0,0 +1,3 @@ +kep-number: 5194 +alpha: + approver: "@johnbelamaric" diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/README.md b/keps/sig-scheduling/5194-reserved-for-workloads/README.md new file mode 100644 index 00000000000..dbf5c9e7171 --- /dev/null +++ b/keps/sig-scheduling/5194-reserved-for-workloads/README.md @@ -0,0 +1,796 @@ + +# KEP-5194: DRA ReservedFor Workloads + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) + - [Higher memory usage by the device_taint_eviction controller](#higher-memory-usage-by-the-device_taint_eviction-controller) + - [The number of pods that can share a ResourceClaim will not be unlimited](#the-number-of-pods-that-can-share-a-resourceclaim-will-not-be-unlimited) +- [Design Details](#design-details) + - [Background](#background) + - [Deallocation](#deallocation) + - [Finding pods using a ResourceClaim](#finding-pods-using-a-resourceclaim) + - [Proposal](#proposal-1) + - [API](#api) + - [Implementation](#implementation) + - [Deallocation](#deallocation-1) + - [Finding pods using a ResourceClaim](#finding-pods-using-a-resourceclaim-1) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Increase the size limit on the ReservedFor field](#increase-the-size-limit-on-the-reservedfor-field) + - [Relax validation without API changes](#relax-validation-without-api-changes) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Production readiness review completed +- [x] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +One of the features of Dynamic Resource Allocation is that multiple pods can +share a single ResourceClaim, which means they also share the allocated devices. This +enables several important use-cases. However, currently the number +of pods that can share a single ResourceClaim is limited to 256. We have concrete +use-cases that require that thousands of pods share a single ResourceClaim. With this +KEP, the hard limit on the number of pods will be removed. + +## Motivation + +Training workloads that uses TPUs can be very large, requiring over 9,000 +TPUs for a single training job. The number of TPUs for each node is usually 4, meaning +that the job will run across more than 2,000 nodes. Due to topology constraints, TPU slices +are usually modelled in DRA as multi-host devices, meaning that a single DRA device +can represent thousands of TPUs. As a result, all pods running the workload will +therefore share a single ResourceClaim. The current limit of 256 pods sharing a +ResourceClaim is therefore too low. + +### Goals + +- Enable ResourceClaims to be shared by any number of pods. + +### Non-Goals + + +## Proposal + +Rather than expecting the `ReservedFor` field to contain an exhaustive list of +all pods using the ResourceClaim, we propose letting the controller managing +a ResourceClaim specify the reference to the resource consuming the claim +in the spec. This will then be used as the consumer of the ResourceClaim, removing +the need for every pod in the workload to be listed. + +Increasing the allowed number of pods in the list was considered, but rejected +for two primary reasons: +* The size of AI workloads are getting larger so it is hard to come up with a new + threshold +* Having a list with thousands of entries is neither a good nor scalable solution. + +The `ReservedFor` list already accepts generic resource references, so this +field doesn't need to be changed. However, we are proposing adding two new +fields to the `ResourceClaim` type: +* `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in + the spec which resource is the consumer of the `ResourceClaim`. When the first pod + referencing the `ResourceClaim` is scheduled, the reference will be copied into + the `status.ReservedFor` list. +* `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA + scheduler plugin at allocation time when the `spec.ReservedFor` field is copied + into the `status.ReservedFor` list. If `status.allocation.ReservedForAnyPod` is + set to `true`, the kubelet will skip the check that requires pods to be listed as + a consumer of the claim when starting the pod. + +### Risks and Mitigations + +#### Higher memory usage by the device_taint_eviction controller +The device_taint_eviction controller will need to keep an index of which pods are +referenced from each ResourceClaim, so it can evict the correct pods when devices +are tainted. This will require some additional memory. + +#### The number of pods that can share a ResourceClaim will not be unlimited +Removing this limit does not mean that the number of pods that can share a ResourceClaim +will be unlimited. As part of the +[scale testing effort for DRA](https://github.com/kubernetes/kubernetes/issues/131198), +we will test the scalability of the number of pods sharing a ResourceClaim so we can +provide guidance as to what is a safe number. + +## Design Details + +### Background + +The `ReservedFor` field on the `ResourceClaimStatus` is currently used for two purposes: + +#### Deallocation +Devices are allocated to `ResourceClaim`s when the first pod referencing the claim is +scheduled. Other pods can also share the `ResourceClaim` in which case they share the +devices. Once no pods are consuming the claim, the devices should be deallocated to they +can be allocted to other claims. The `ReservedFor` list is used to keep track of pods +consuming a `ResourceClaim`. Pods are added to the list by the DRA scheduler plugin +during scheduling and removed from the list by the resourceclaim controller when pods are +deleted or finish running. An empty list means there are no current consumers of the claim +and it can be deallocated. + +#### Finding pods using a ResourceClaim +It is used by the DRA scheduler plugin, the kubelet, and the device_taint_eviction +controller to find pods that are using the ResourceClaim: + +1. The kubelet uses this to make sure it only runs pods that where the claims have been allocated + to the pod. It can verify this by checking that the Pod is listed in the `ReservedFor` list. + +1. The DRA scheduler plugin uses the list to find claims that have zero or only + a single pod using it, and is therefore a candidate for deallocation in the `PostFilter` function. + +1. The device_taint_eviction controller uses the `ReservedFor` list to find the pods that needs to be evicted + when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim + does not have a toleration). + +So the solution needs to: +* Give the resourceclaim controller a way to know when there are no more consumers of a ResourceClaim so + it can be deallocated. +* Give controllers a way to list the pods consuming or referencing a ResourceClaim. + +### Proposal + +#### API + +The exact set of proposed API changes can be seen below (`...` is used in places where new fields +are added to existing types): + +```go +// ResourceClaimSpec defines what is being requested in a ResourceClaim and how to configure it. +type ResourceClaimSpec struct { + ... + + // ReservedFor specifies the resource that will be consuming the claim. If set, the + // reference will be copied into the status.ReservedFor list when the claim is allocated. + // + // When this field is set it is the responsibility of the entity that created the + // ResourceClaim to remove the reference from the status.ReservedFor list when there + // are no longer any pods consuming the claim. + // + // +featureGate=DRAReservedForWorkloads + // +optional + ReservedFor *ResourceClaimConsumerReference +} + +// AllocationResult contains attributes of an allocated resource. +type AllocationResult struct { + ... + + // ReservedForAnyPod specifies whether the ResourceClaim can be used by + // any pod referencing it. If set to true, the kubelet will not check whether + // the pod is listed in the staus.ReservedFor list before running the pod. + // + // +featureGate=DRAReservedForWorkloads + // +optional + ReservedForAnyPod *bool +} +``` + +The `ResourceClaimConsumerReference type already exists: + +```go +// ResourceClaimConsumerReference contains enough information to let you +// locate the consumer of a ResourceClaim. The user must be a resource in the same +// namespace as the ResourceClaim. +type ResourceClaimConsumerReference struct { + // APIGroup is the group for the resource being referenced. It is + // empty for the core API. This matches the group in the APIVersion + // that is used when creating the resources. + // +optional + APIGroup string + // Resource is the type of resource being referenced, for example "pods". + // +required + Resource string + // Name is the name of resource being referenced. + // +required + Name string + // UID identifies exactly one incarnation of the resource. + // +required + UID types.UID +} +``` + +#### Implementation +Whenever the scheduler (i.e. the DRA scheduler plugin) tries to schedule a pod that +references a `ResourceClaim` with an empty `status.ReservedFor` list, it knows that this +is the first pod that will be consuming the claim. + +If the `spec.ReservedFor` field in the ResourceClaim is not set, the scheduler will handle +the `ResourceClaim` in the same was as now, and will add the `Pod` to the `ReservedFor` list +if devices could be allocated for the claim. Any additional pods that reference the `ResourceClaim` +will also be added to the list. + +If the `spec.ReservedFor` field is set, the scheduler will copy this reference to the +`ReservedFor` list, rather than adding a reference to the `Pod`. It will also update the +`status.allocation.ReservedForAnyPod` field to `true`. When any other pods referencing +the `ResourceClaim` is scheduled and the scheduler sees a non-Pod reference in the `ReservedFor` +list, it will not add a reference to the pod. + +##### Deallocation +The resourceclaim controller will remove Pod references from the `ReservedFor` list just +like it does now, but it will never remove references to non-Pod resources. Instead, it +will be the responsibility of the controller/user that created the `ResourceClaim` to +remove the reference to the non-Pod resource from the `ReservedFor` list when no pods +are consuming the `ResourceClaim` and no new pods will be created that references +the `ResourceClaim`. + +The resourceclaim controller will then discover that the `ReservedFor` list is empty +and therefore know that it is safe to deallocate the `ResourceClaim`. + +This requires that the controller/user has permissions to update the status +subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if +the resource referenced in the `ReservedFor` list has been deleted from the cluster, but +that requires that the controller has permissions to get or list resources of the type. If the +resourceclaim controller is not able to check, it will just wait until the reference in +the `ReservedFor` list is removed. + +##### Finding pods using a ResourceClaim +If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer +use the list to find all pods consuming the `ResourceClaim`. Instead they will look up all +pods referencing the `ResourceClaim`, which can be done by using a watch on Pods and maintaining +an index of `ResourceClaim` to pods referencing it. This can be done using the informer cache. + +The list of pods referencing a `ResourceClaim` is not exactly the same as the list of pods +consuming a `ResourceClaim` as specified in the `ReservedFor` list. References to pods in the +`ReservedFor` list only contains pods that have been processed by the DRA scheduler plugin and +is scheduled to use the `ResourceClaim`. It is possible to have pods that reference `ResourceClaim`, +but haven't yet been scheduled. This distinction is important for some of the usages of the +`ReservedFor` list described above: + +1. If the kubelet sees that the `status.allocation.ReservedForAnyPod` is set, it will skip + the check that the Pod is listed in the `ReservedFor` list and just run the pod. + +1. If the DRA scheduler plugin is trying to find candidates for deallocation in + the `PostFilter` function and sees a `ResourceClaim` with a non-Pod reference, it will not + attempt to deallocate. The plugin has no way to know how many pods are actually consuming + the `ResourceClaim` without the explit list in the `ReservedFor` list and therefore it will + not be safe to deallocate. + +1. The device_taint_eviction controller will use the list of pods referencing the `ResourceClaim` + to determine the list of pods that needs to be evicted. In this situation, it is ok if the + list includes pods that haven't yet been scheduled. + + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +None + +##### Unit tests + + + + +- `k8s.io/kubernetes/pkg/controller/devicetainteviction`: `06/05/2025` - 89.9% +- `k8s.io/kubernetes/pkg/controller/resourceclaim`: `06/05/2025` - 74.2% +- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `06/05/2025` - 79.4% +- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources`: `06/05/2025` - 79.3% + +##### Integration tests + +Scheduler perf tests will be added to assess the performance impact of this change. + +##### e2e tests + +Additional e2e tests will be added to verify the behavior added in this KEP. + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta + +- Gather feedback from developers and surveys +- Additional tests are in Testgrid and linked in KEP +- Performance impact of the feature has been measured and found to be acceptable + +#### GA + +- 3 examples of real-world usage +- Allowing time for feedback +- [conformance tests] + +[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md + + +### Upgrade / Downgrade Strategy + +The feature will no longer work if downgrading to a release without support for it. +The API server will no longer accept the new fields and the other components will +not know what to do with them. So the result is that the `ReservedFor` list will only +have references to pod resources like today. + +### Version Skew Strategy + +If the kubelet is on a version that doesn't support the feature but the rest of the +components are, workloads will be scheduled, but the kubelet will refuse to run it +since it will still check whether the `Pod` is references in the `ReservedFor` list. + +If the API server is on a version that supports the feature, but the scheduler +is not, the scheduler will not know about the new fields added, so it will +put the reference to the `Pod` in the `ReservedFor` list rather than the reference +in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but +it will be subject to the 256 limit on the size of the `ReservedFor` list and the +controller creating the `ResourceClaim` will not find the reference it expects +in the `ReservedFor` list when it tries to remove it. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: DRAReservedForWorkloads + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler + - kube-controller-manager + - kubelet + +###### Does enabling the feature change any default behavior? + +No + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. Applications that were already running will continue to run and the allocated +devices will remain so. + +###### What happens if we reenable the feature if it was previously rolled back? + +It will take affect again and will impact how the `ReservedFor` field is used during allocation +and deallocation. + +###### Are there any tests for feature enablement/disablement? + +This will be covered through unit tests for the apiserver, scheduler, resourceclaim controller and +kubelet. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + +It does require that: +- The device_taint_eviction controller watches Pods + +###### Will enabling / using this feature result in introducing new API types? + +No + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +No + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +It might require some additional memory usage in the resourceclaim controller since it will need to keep an index +of ResourceClaim to Pods. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + +- 1.34: first KEP revision + +## Drawbacks + +This complicates the allocation and deallocation logic somewhat as there will be two +separate ways to manage the allocation and deallocation process for ResourceClaims. + +It also leads to additional work for the device_taint_eviction controller since it needs +to maintain an index to find all pods using a ResourceClaim rather than just looking at +the list of pods in the `ReservedFor` list. + +## Alternatives + +### Increase the size limit on the ReservedFor field +The simplest solution here would be to just increase the size limit on the +`ReservedFor` field to a larger number. But having a large list of pod references +is not a good way to handle it and could at least in theory run into the size limit +of Kubernetes resources. Also, we would need to have some limit on the size, and whatever +number we choose it might still be too small for the largest workloads. + +### Relax validation without API changes +The current proposal adds explicit support for non-pod references in the `ReservedFor` list +by adding the new `spec. An alternative is to let the workload controller be responsible for +not only removing the reference in the `ReservedFor` list when there are no longer any pods +consuming the `ResourceClaim`, but also adding the reference after creating the `ResourceClaim`. +This will require that the validation is relaxed to allow entries in the `ReservedFor` list +without any allocation. This would also require that the Kubelet checks for non-Pod references +in the `ReservedFor` list and skips the check before running pods if it finds any. + +This isn't all that different than the proposed solution, but the solution described above +was considered superior as it makes the new feature more explicit. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/kep.yaml b/keps/sig-scheduling/5194-reserved-for-workloads/kep.yaml new file mode 100644 index 00000000000..d9ff2a245ef --- /dev/null +++ b/keps/sig-scheduling/5194-reserved-for-workloads/kep.yaml @@ -0,0 +1,50 @@ +title: DRA ReservedFor Workloads +kep-number: 5194 +authors: + - "@mortent" +owning-sig: sig-scheduling +participating-sigs: + - sig-autoscaling +status: implementable +creation-date: 2025-05-29 +reviewers: + - "@pohly" + - "@johnbelamaric" +approvers: + - "@dom4ha" # SIG-Scheduling + - "@jackfrancis" # SIG-Autoscaling + - "@thockin" # API Review + +see-also: + - "/keps/sig-node/4381-dra-structured-parameters" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.34" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.34" + beta: "v1.35" + stable: "v1.36" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRAReservedForWorkloads + components: + - kube-apiserver + - kube-scheduler + - kube-controller-manager + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + #- my_feature_metric From 2c98d9d75676d0b9b66c66d2e7c5d47605d16424 Mon Sep 17 00:00:00 2001 From: Morten Torkildsen Date: Sat, 7 Jun 2025 18:38:26 +0000 Subject: [PATCH 2/5] Addressed comments --- keps/prod-readiness/sig-scheduling/5194.yaml | 2 +- .../5194-reserved-for-workloads/README.md | 50 +++++++++++++------ 2 files changed, 35 insertions(+), 17 deletions(-) diff --git a/keps/prod-readiness/sig-scheduling/5194.yaml b/keps/prod-readiness/sig-scheduling/5194.yaml index 7654598b151..c6101a46039 100644 --- a/keps/prod-readiness/sig-scheduling/5194.yaml +++ b/keps/prod-readiness/sig-scheduling/5194.yaml @@ -1,3 +1,3 @@ kep-number: 5194 alpha: - approver: "@johnbelamaric" + approver: "@wojtek-t" diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/README.md b/keps/sig-scheduling/5194-reserved-for-workloads/README.md index dbf5c9e7171..f1656ebd2be 100644 --- a/keps/sig-scheduling/5194-reserved-for-workloads/README.md +++ b/keps/sig-scheduling/5194-reserved-for-workloads/README.md @@ -177,7 +177,7 @@ KEP, the hard limit on the number of pods will be removed. Training workloads that uses TPUs can be very large, requiring over 9,000 TPUs for a single training job. The number of TPUs for each node is usually 4, meaning that the job will run across more than 2,000 nodes. Due to topology constraints, TPU slices -are usually modelled in DRA as multi-host devices, meaning that a single DRA device +are usually modeled in DRA as multi-host devices, meaning that a single DRA device can represent thousands of TPUs. As a result, all pods running the workload will therefore share a single ResourceClaim. The current limit of 256 pods sharing a ResourceClaim is therefore too low. @@ -237,7 +237,7 @@ provide guidance as to what is a safe number. The `ReservedFor` field on the `ResourceClaimStatus` is currently used for two purposes: #### Deallocation -Devices are allocated to `ResourceClaim`s when the first pod referencing the claim is +Devices are allocated to a `ResourceClaim` when the first pod referencing the claim is scheduled. Other pods can also share the `ResourceClaim` in which case they share the devices. Once no pods are consuming the claim, the devices should be deallocated to they can be allocted to other claims. The `ReservedFor` list is used to keep track of pods @@ -256,7 +256,7 @@ controller to find pods that are using the ResourceClaim: 1. The DRA scheduler plugin uses the list to find claims that have zero or only a single pod using it, and is therefore a candidate for deallocation in the `PostFilter` function. -1. The device_taint_eviction controller uses the `ReservedFor` list to find the pods that needs to be evicted +1. The device_taint_eviction controller uses the `ReservedFor` list to find the pods that need to be evicted when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim does not have a toleration). @@ -284,6 +284,9 @@ type ResourceClaimSpec struct { // ResourceClaim to remove the reference from the status.ReservedFor list when there // are no longer any pods consuming the claim. // + // Most user-created ResourceClaims should not set this field. It is more typically + // used by ResourceClaims created and managed by controllers. + // // +featureGate=DRAReservedForWorkloads // +optional ReservedFor *ResourceClaimConsumerReference @@ -303,7 +306,7 @@ type AllocationResult struct { } ``` -The `ResourceClaimConsumerReference type already exists: +The `ResourceClaimConsumerReference` type already exists: ```go // ResourceClaimConsumerReference contains enough information to let you @@ -345,21 +348,24 @@ list, it will not add a reference to the pod. ##### Deallocation The resourceclaim controller will remove Pod references from the `ReservedFor` list just -like it does now, but it will never remove references to non-Pod resources. Instead, it -will be the responsibility of the controller/user that created the `ResourceClaim` to -remove the reference to the non-Pod resource from the `ReservedFor` list when no pods +like it does now using the same logic. For non-Pod references, the controller will recognize +a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will +remove the reference from the list when those resources are removed. For other types, +it will be the responsibility of the workload controller/user that created the `ResourceClaim` +to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods are consuming the `ResourceClaim` and no new pods will be created that references the `ResourceClaim`. The resourceclaim controller will then discover that the `ReservedFor` list is empty and therefore know that it is safe to deallocate the `ResourceClaim`. -This requires that the controller/user has permissions to update the status -subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if -the resource referenced in the `ReservedFor` list has been deleted from the cluster, but -that requires that the controller has permissions to get or list resources of the type. If the -resourceclaim controller is not able to check, it will just wait until the reference in -the `ReservedFor` list is removed. +This requires that the resourceclaim controller watches the workload types that will +be supported. For other types of workloads, there will be a requirement that the workload +controller has permissions to update the status subresource of the `ResourceClaim`. The +resourceclaim controller will also try to detect if an unknown resource referenced in the +`ReservedFor` list has been deleted from the cluster, but that requires that the controller +has permissions to get or list resources of the type. If the resourceclaim controller is +not able to check, it will just wait until the reference in the `ReservedFor` list is removed. ##### Finding pods using a ResourceClaim If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer @@ -444,11 +450,17 @@ Additional e2e tests will be added to verify the behavior added in this KEP. - Gather feedback from developers and surveys - Additional tests are in Testgrid and linked in KEP - Performance impact of the feature has been measured and found to be acceptable +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved #### GA -- 3 examples of real-world usage - Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved - [conformance tests] [conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md @@ -531,8 +543,13 @@ No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes. Applications that were already running will continue to run and the allocated +Applications that were already running will continue to run and the allocated devices will remain so. +For the resource types supported directly, the resource claim controller will not remove the +reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload +controller is responsible for removing the reference, deallocation will work as long as the +feature isn't also disabled in the controllers. If they are, deallocation will not happen in this +situation either. ###### What happens if we reenable the feature if it was previously rolled back? @@ -715,7 +732,8 @@ No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? It might require some additional memory usage in the resourceclaim controller since it will need to keep an index -of ResourceClaim to Pods. +of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and +Jobs which might also cause a slight increase in memory usage. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? From 7c589192d730c2f390f89f1eec9a3ff3027d2b15 Mon Sep 17 00:00:00 2001 From: Morten Torkildsen Date: Thu, 12 Jun 2025 20:00:54 +0000 Subject: [PATCH 3/5] Addressed more comments --- .../5194-reserved-for-workloads/README.md | 80 ++++++++++++------- 1 file changed, 50 insertions(+), 30 deletions(-) diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/README.md b/keps/sig-scheduling/5194-reserved-for-workloads/README.md index f1656ebd2be..d2dd280b978 100644 --- a/keps/sig-scheduling/5194-reserved-for-workloads/README.md +++ b/keps/sig-scheduling/5194-reserved-for-workloads/README.md @@ -336,7 +336,7 @@ references a `ResourceClaim` with an empty `status.ReservedFor` list, it knows t is the first pod that will be consuming the claim. If the `spec.ReservedFor` field in the ResourceClaim is not set, the scheduler will handle -the `ResourceClaim` in the same was as now, and will add the `Pod` to the `ReservedFor` list +the `ResourceClaim` in the same way as now, and will add the `Pod` to the `ReservedFor` list if devices could be allocated for the claim. Any additional pods that reference the `ResourceClaim` will also be added to the list. @@ -348,24 +348,24 @@ list, it will not add a reference to the pod. ##### Deallocation The resourceclaim controller will remove Pod references from the `ReservedFor` list just -like it does now using the same logic. For non-Pod references, the controller will recognize -a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will -remove the reference from the list when those resources are removed. For other types, -it will be the responsibility of the workload controller/user that created the `ResourceClaim` -to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods +like it does now using the same logic. But for non-Pod references, it +will be the responsibility of the controller/user that created the `ResourceClaim` to +remove the reference to the non-Pod resource from the `ReservedFor` list when no pods are consuming the `ResourceClaim` and no new pods will be created that references the `ResourceClaim`. The resourceclaim controller will then discover that the `ReservedFor` list is empty and therefore know that it is safe to deallocate the `ResourceClaim`. -This requires that the resourceclaim controller watches the workload types that will -be supported. For other types of workloads, there will be a requirement that the workload -controller has permissions to update the status subresource of the `ResourceClaim`. The -resourceclaim controller will also try to detect if an unknown resource referenced in the -`ReservedFor` list has been deleted from the cluster, but that requires that the controller -has permissions to get or list resources of the type. If the resourceclaim controller is -not able to check, it will just wait until the reference in the `ReservedFor` list is removed. +This requires that the controller/user has permissions to update the status +subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if +the resource referenced in the `ReservedFor` list has been deleted from the cluster, but +that requires that the controller has permissions to get or list resources of the type. If the +resourceclaim controller is not able to check, it will just wait until the reference in +the `ReservedFor` list is removed. The resourceclaim controller will not have a watch +on the workload resource, so there is no guarantee that the controller will realize that +the resource has been deleted. This is an extra check since it is the responsibility of +the workload controller to update the claim. ##### Finding pods using a ResourceClaim If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer @@ -386,7 +386,7 @@ but haven't yet been scheduled. This distinction is important for some of the us 1. If the DRA scheduler plugin is trying to find candidates for deallocation in the `PostFilter` function and sees a `ResourceClaim` with a non-Pod reference, it will not attempt to deallocate. The plugin has no way to know how many pods are actually consuming - the `ResourceClaim` without the explit list in the `ReservedFor` list and therefore it will + the `ResourceClaim` without the explicit list in the `ReservedFor` list and therefore it will not be safe to deallocate. 1. The device_taint_eviction controller will use the list of pods referencing the `ResourceClaim` @@ -455,7 +455,10 @@ Additional e2e tests will be added to verify the behavior added in this KEP. - All security enforcement completed - All monitoring requirements completed - All testing requirements completed -- All known pre-release issues and gaps resolved +- All known pre-release issues and gaps resolved +- Revisit whether the responsibility of removing the workload resource reference from + the `ReservedFor` list should be with the workload controller (as proposed in this design) + or be handled by the resourceclaim controller. #### GA @@ -473,6 +476,19 @@ The API server will no longer accept the new fields and the other components wil not know what to do with them. So the result is that the `ReservedFor` list will only have references to pod resources like today. +Any ResourceClaims that have already been allocated when the feature was active will +have non-pod references in the `ReservedFor` list after a downgrade, but the controllers +will not know how to handle it. There are two problems that will arise as a result of +this: +- The workload controller will also have been downgraded if it is in-tree, meaning that + it will not remove the reference to workload resource from the `ReservedFor` list, thus + leading to a situation where the claim will never be deallocated. +- For new pods that gets scheduled, the scheduler will add pod references in the + `ReservedFor` list, despite there being a non-pod reference here. So it ends up with + both pod and non-pod references in the list. We need to make sure the system can + handle this, as it might also happen as a result of disablement and the enablement + of the feature. + ### Version Skew Strategy If the kubelet is on a version that doesn't support the feature but the rest of the @@ -482,10 +498,9 @@ since it will still check whether the `Pod` is references in the `ReservedFor` l If the API server is on a version that supports the feature, but the scheduler is not, the scheduler will not know about the new fields added, so it will put the reference to the `Pod` in the `ReservedFor` list rather than the reference -in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but -it will be subject to the 256 limit on the size of the `ReservedFor` list and the -controller creating the `ResourceClaim` will not find the reference it expects -in the `ReservedFor` list when it tries to remove it. +in the `spec.ReservedFor` list. It will do this even if there is already a non-pod +reference in the `spec.ReservedFor` list. This leads to the challenge described +in the previous section. ## Production Readiness Review Questionnaire @@ -543,18 +558,21 @@ No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Applications that were already running will continue to run and the allocated -devices will remain so. -For the resource types supported directly, the resource claim controller will not remove the -reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload -controller is responsible for removing the reference, deallocation will work as long as the -feature isn't also disabled in the controllers. If they are, deallocation will not happen in this -situation either. +Applications that were already running will continue to run. But if a pod have to be +re-admitted by a kubelet where the feature has been disabled, it will not be able to, since +the kubelet will not find a reference to the pod in the `ReservedFor` list. + +The feature will also be disabled for in-tree workload controllers, meaning that they will +not remove the reference to the pod from the `ReservedFor` list. This means the list will never +be empty and the resourceclaim controller will never deallocate the claim. ###### What happens if we reenable the feature if it was previously rolled back? It will take affect again and will impact how the `ReservedFor` field is used during allocation -and deallocation. +and deallocation. Since this scenario allows a ResourceClaim with the `spec.ReservedFor` field +to be set and then have the scheduler populate the `ReservedFor` list with pods when the feature +is disabled, we will end up in a situation where the `ReservedFor` list can contain both non-pod +and pod references. We need to make sure all components can handle that. ###### Are there any tests for feature enablement/disablement? @@ -723,7 +741,10 @@ No ###### Will enabling / using this feature result in increasing size or count of the existing API objects? -No +Yes and no. We are adding two new fields to the ResourceClaim type, but neither are of a collection type +so they should have limited impact on the total size of the objects. However, this feature means that +we no longer need to keep a complete list of all pods using a ResourceClaim, which can significantly +reduce the size of ResourceClaim objects shared by many pods. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -732,8 +753,7 @@ No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? It might require some additional memory usage in the resourceclaim controller since it will need to keep an index -of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and -Jobs which might also cause a slight increase in memory usage. +of ResourceClaim to Pods. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? From a6a044eaed42fe18e0928b9708e5128bee816f26 Mon Sep 17 00:00:00 2001 From: Morten Torkildsen Date: Fri, 13 Jun 2025 16:07:55 +0000 Subject: [PATCH 4/5] Fixed typo --- keps/sig-scheduling/5194-reserved-for-workloads/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/README.md b/keps/sig-scheduling/5194-reserved-for-workloads/README.md index d2dd280b978..e07bfbf644a 100644 --- a/keps/sig-scheduling/5194-reserved-for-workloads/README.md +++ b/keps/sig-scheduling/5194-reserved-for-workloads/README.md @@ -563,7 +563,7 @@ re-admitted by a kubelet where the feature has been disabled, it will not be abl the kubelet will not find a reference to the pod in the `ReservedFor` list. The feature will also be disabled for in-tree workload controllers, meaning that they will -not remove the reference to the pod from the `ReservedFor` list. This means the list will never +not remove the reference to the workload resource from the `ReservedFor` list. This means the list will never be empty and the resourceclaim controller will never deallocate the claim. ###### What happens if we reenable the feature if it was previously rolled back? From f8ed2595ed660a1f821897f1764d64176aa30eea Mon Sep 17 00:00:00 2001 From: Morten Torkildsen Date: Tue, 17 Jun 2025 00:08:29 +0000 Subject: [PATCH 5/5] Clarify more around the downgrade and feature disablement scenarios --- .../5194-reserved-for-workloads/README.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/keps/sig-scheduling/5194-reserved-for-workloads/README.md b/keps/sig-scheduling/5194-reserved-for-workloads/README.md index e07bfbf644a..a6af2339da5 100644 --- a/keps/sig-scheduling/5194-reserved-for-workloads/README.md +++ b/keps/sig-scheduling/5194-reserved-for-workloads/README.md @@ -485,9 +485,17 @@ this: leading to a situation where the claim will never be deallocated. - For new pods that gets scheduled, the scheduler will add pod references in the `ReservedFor` list, despite there being a non-pod reference here. So it ends up with - both pod and non-pod references in the list. We need to make sure the system can - handle this, as it might also happen as a result of disablement and the enablement - of the feature. + both pod and non-pod references in the list. We can manage both pod and non-pod + references in the list by letting the workload controllers add the non-pod reference + even if it sees pod references and making sure that the resourceclaim controller removes + pod references even if there are non-pod references in the list. For deallocation, it is + only safe when no pods are consuming the claim, so both workload and pod reference should + be removed once that is true. + +We will also provide explicit recommendations for how users can manage downgrades or +disabling this feature. This means manually updating the references in the `ReservedFor` list +to be pods rather than the reference to workload resources. We don't plan on providing +automation for this. ### Version Skew Strategy