KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

mortent · 2025-04-11T17:16:43Z

One-line PR description: Update ParitionableDevices KEP and create separate ResourceSliceMixins KEP

Issue link: DRA: ResourceSlice Mixins #5234
Issue link: DRA: Partitionable Devices #4815

Other comments: This updates 4815 to reflect the functionality that was actually implemented for 1.33. The mixins feature was originally part of 4815, but go cut from the scope for 1.33. So this moves that functionality into a separate KEP 5234.

…s KEP

mortent · 2025-04-11T17:28:27Z

/wg device-management
/cc @pohly @johnbelamaric @klueska

k8s-ci-robot · 2025-04-18T04:02:41Z

@bg-furiosa: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bg-chun · 2025-04-18T04:02:59Z

/lgtm

jackfrancis · 2025-04-18T16:32:57Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+  // The mixins referenced here must be defined in the same
+  // ResourceSlice.
+  //
+  // The maximum number of includes is 8.


There are a lot of maximums declared as comments in various struct properties. Does it make sense to justify these in the KEP proposal language so that there is a clear historical discovery trail for why these maximums were originally chosen?

Yeah, I've added more details in both KEPs under the Will enabling / using this feature result in increasing size or count of the existing API objects? sections. But it is getting somewhat complicated, so I think we should take a step back and make sure we did get this right. I will follow up on that.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

pohly · 2025-04-22T07:47:47Z

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

-type. The new function will be offered through the newly added fields under `BasicDevice`.
-The kube-scheduler is expected to match the kube-apiserver minor version, 
+The API extensions proposed in this KEP are added as new fields on the
+`ResourceSliceSpec` and `Device` types. All the fields will be behind


Side note: there's one oddity about KEPs. When a KEP is initially written, it uses future tense ("we will add tests"). But at some point the feature is implemented and the KEP merely serves as documentation for what has been actually done. At that point, using "we added tests" would be more appropriate.

I don't know of any "best practice" for this. What I have done in my KEPs is that when I touched some sections, I changed the wording, but it wasn't necessarily consistent overall.

Perhaps present tense would be best? It kind of works for the initial revision and when reading it later as documentation.

In this paragraph, we have present tense ("are added") and future tense ("will be")...

I've updated the language in this section to avoid both present and future tense.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

pohly · 2025-04-22T08:21:46Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+
+We have discussed adding a kubectl command or a plugin that will allow
+users to see the fully flattened versions of a ResourceSlice. But this
+is not in scope for alpha.


There is one more risk: by allowing devices and counter sets that have potentially more entries, the worst-case scenarios for scheduling becomes worse.

I think that's fine and we can add the usual disclaimer that "it depends on what DRA driver authors decide to use", which limits the potential for abuse because normal users can't trigger it.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

k8s-ci-robot · 2025-04-23T17:37:06Z

New changes are detected. LGTM label has been removed.

keps/sig-scheduling/4815-dra-partitionable-devices/README.md

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

keps/prod-readiness/sig-scheduling/5234.yaml

keps/sig-scheduling/5234-dra-resourceslice-mixins/kep.yaml

johnbelamaric · 2025-04-24T17:08:00Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+  be reduced and a larger number of devices can be published within a single
+  ResourceSlice.
+- Enable defining devices with more attributes, capacities, and consumed counters.
+- Enable defining counter sets with more counters.


Two things we should consider (not sure if these are goals or implementation choices): 1) enabling mixins to be per-pool not per-resource slice; 2) enabling counters to be per-pool not per-resource slice.

If necessary, these could be considered in beta. But I do think we're going to need them in time. The second may belong in partitionable not here.

Yeah, we have an issue for that in kubernetes/kubernetes#130785. We definitely need to make a decision on this in this cycle, as I think changing this must happen over two releases. I was hoping to handle this separately from this KEP, but open to including it here.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

mortent · 2025-04-24T19:53:24Z

This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice.

An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit.

We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin.

I suggest we set the limits to something like:

Total combined number of attributes and capacity in a ResourceSlice is 4096 (32 * 128 devices)
Total number of counters is 256
Total number of consumed counters is 2048 (16 * 128 devices)

So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins.

I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones.
But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases. But I think adding the ResourceSlice-wide limit of 4096 to cover both mixins and devices should be safe so we avoid any mixin-specific limits.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

jackfrancis · 2025-04-24T21:59:13Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+#### More attributes, capacities and counters might worsen worst-case scheduling
+
+With mixins, DRA driver authors can choose to create more complex devices,
+which might lead to worse scheduling performance. But it is up to DRA driver


I might replace "But it is up to DRA driver authors to do this, so it is not something that can be triggered by normal users." with something like:

"This will not negatively effect existing scheduling performance of existing ResourceSlice definitions, but DRA driver authors taking advantage of mixins should be made aware of possible performance effects due to this increased referential complexity. Furthermore, this reinforces the criticality of ensuring that DRA primitives are optimized for maximum performance."

(That's my takeaway, at least.)

I update the KEP to include your suggestion with some small edits.

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

jackfrancis · 2025-04-24T22:22:17Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+// The main purposes of these mixins is to reduce the memory footprint
+// of devices since they can reference the mixins provided here rather
+// than duplicate them.
+type ResourceSliceMixins struct {


Sorry if this initiates a lengthy, annoying naming debate! But: because this is a struct would it be preferable to call this ResourceSliceMixinSpec instead (and then we'd ostensibly update the changes to ResourceSliceSpec to include a new MixinSpec property (instead of a Mixins property).

Interesting question, but when I look at the existing fields on the ResourceSliceSpec, they don't use the ...Spec naming convention in the field name or the type. So for consistency, I think we should just follow the pattern and keep the current names.

jackfrancis · 2025-04-24T22:30:31Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+### Implementation
+
+The DRA scheduler plugin will flatten the counter sets and devices before
+going through the allocation process. This will happen as part of conversion


Putting my SIG Autoscaling hat on, is this a sufficient description of the plan to provide a standard interface for rendering the "flattened" ResourceSlice (after following and processing all mixin references)? In the worst case scenario those conversions happen surgically throughout various parts of the k/k codebase, which would make it hard for downstream components like cluster-autoscaler and karpenter to plumb into durable, reusable libraries.

Yeah, I think this is a good question. Every tool that needs to understand the full device definitions will need to flatten the mixins, but the suggested implementation here doesn't lend itself easily to reuse as the flattening happens as part of conversion into the scheduler-specific format.
I added a separate section under "Risks and Mitigations" for this. The most obvious solution here is that we provide a library that handles the flattening, although I'm not sure which types (most likely v1beta2) we should do this for.

jackfrancis · 2025-04-24T22:32:43Z

This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice.

An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit.

We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin.

I suggest we set the limits to something like:

Total combined number of attributes and capacity in a ResourceSlice is 4096 (32 * 128 devices)

Total number of counters is 256

Total number of consumed counters is 2048 (16 * 128 devices)

So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins.

I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones. But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases. But I think adding the ResourceSlice-wide limit of 4096 to cover both mixins and devices should be safe so we avoid any mixin-specific limits.

Great initial thoughts, I would go ahead and move your thinking and preliminary conclusions into the KEP where it will probably get the most engagement.

pohly · 2025-04-25T07:09:49Z

But I'm not sure if we will be able to easily remove the limit of 32 attributes+capacity per device, so I think this is something that would need to happen over two releases.

Yes, this would need ratcheting.

We have already gradually moved away from individual per-slice and per-map limits towards aggregating at higher levels. You proposal is now basically to move this up to the root level of the entire slice. This makes sense to me and @thockin has approved the previous aggregated API limits, but it still is a bit unusual. Therefore I would like to hear from others what they think about taking this approach to the logical conclusion.

dom4ha

Thanks for splitting these KEPs, as it's more self-contained and it's much easier to understand this enhancement now.

keps/sig-scheduling/5234-dra-resourceslice-mixins/kep.yaml

dom4ha · 2025-04-25T09:34:47Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+
+### Implementation
+
+The DRA scheduler plugin will flatten the counter sets and devices before


By flattening we still risk that large number of references combined with a large size of mixing could cause scheduler OOM, as there would be no mechanism of keeping the consumption under control.

I think the in-memory representation should stay as is, but the allocator should iterate over mixins somehow. Is that feasible?

We could do that, but it means additional work to dereference the mixins every time they are needed and more complexity in the allocator to handle it.

I also think the question around memory usage in the scheduler goes beyond just mixins. The memory usage per device does matter, but so does the number of devices. Currently we allow a maximum of 127 devices per ResourceSlice, but there is no other limit on the number of ResourceSlices than the number of objects for a single type in Kubernetes. We can make changes to the allocator to make sure we don't try to keep all devices in memory at the same time, but that comes with other challenges.

We probably should look at whether we should place a limit on the number of devices in a cluster and then see what kind of impact that has on the memory usage of the scheduler.

dom4ha · 2025-04-25T09:42:31Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+the devices that they manage in ResourceSlices. This information is used by the
+scheduler when selecting devices for user requests in ResourceClaims.
+
+With this KEP, DRA drivers can define metadata in mixins separately from specific


I suspect this topic might have been discussed already, but why don't we use a word templates instead of mixins? Wouldn't it be more straightforward naming convention that would be easier to understand?

Not sure what you mean by word templates here. Could you provide an example?

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

dom4ha · 2025-04-25T09:50:16Z

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md

+type ResourceSliceSpec struct {
+  ...
+
+  // Mixins defines the mixins available for devices and counter sets


Can you expand the comment to clearly define the purpose of mixins and how they will be merged with other attributes (how possible conflicts are handled)

This is described in the comments on the Includes fields on CounterSet, Device, and DeviceCounterConsumption types. I think that is the right place to document this, as the order of the mixins listed matters for conflicting properties.

k8s-ci-robot · 2025-04-25T20:58:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent
Once this PR has been reviewed and has the lgtm label, please assign dom4ha, jpbetz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 11, 2025

k8s-ci-robot requested a review from alculquicondor April 11, 2025 17:16

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 11, 2025

k8s-ci-robot requested a review from Huang-Wei April 11, 2025 17:16

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 11, 2025

github-project-automation bot added this to SIG Scheduling Apr 11, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 11, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 11, 2025

Update ParitionableDevices KEP and create separate ResourceSliceMixin…

8f1251c

…s KEP

mortent force-pushed the SplitOutMixinsFeature branch from bf1d8cf to 8f1251c Compare April 11, 2025 17:22

mortent changed the title ~~KEP-4815 and KEP-5234: DRA Update 4815 and split out 5234 for mixins~~ KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins Apr 11, 2025

k8s-ci-robot requested review from johnbelamaric, klueska and pohly April 11, 2025 17:28

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Apr 11, 2025

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation Apr 11, 2025

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Apr 11, 2025

pohly moved this from 🆕 New to 🔖 Ready in SIG Node: Dynamic Resource Allocation Apr 15, 2025

pohly moved this from 🔖 Ready to 👀 In review in SIG Node: Dynamic Resource Allocation Apr 15, 2025

mortent mentioned this pull request Apr 17, 2025

[WIP] DRA ResourceSlice mixins kubernetes/kubernetes#131357

Open

bg-chun mentioned this pull request Apr 18, 2025

[KEP-4815]DRA Partitionable device kubernetes/kubernetes#130764

Merged

k8s-ci-robot assigned bg-chun Apr 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2025

jackfrancis reviewed Apr 18, 2025

View reviewed changes

pohly reviewed Apr 22, 2025

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2025

Addressed comments

2558823

mortent force-pushed the SplitOutMixinsFeature branch from 4e339c7 to 2558823 Compare April 23, 2025 17:44

mortent requested a review from pohly April 23, 2025 18:07

pohly reviewed Apr 24, 2025

View reviewed changes

Addressed more comments

20b3c07

mortent force-pushed the SplitOutMixinsFeature branch from c40743b to 20b3c07 Compare April 24, 2025 16:13

johnbelamaric reviewed Apr 24, 2025

View reviewed changes

jackfrancis reviewed Apr 24, 2025

View reviewed changes

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md Outdated Show resolved Hide resolved

jackfrancis reviewed Apr 24, 2025

View reviewed changes

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md Outdated Show resolved Hide resolved

jackfrancis reviewed Apr 24, 2025

View reviewed changes

keps/sig-scheduling/5234-dra-resourceslice-mixins/README.md Outdated Show resolved Hide resolved

jackfrancis reviewed Apr 24, 2025

View reviewed changes

dom4ha reviewed Apr 25, 2025

View reviewed changes

alculquicondor removed their request for review April 25, 2025 13:38

Addressed comments

3eddab7

mortent force-pushed the SplitOutMixinsFeature branch from d2deb9a to 3eddab7 Compare April 25, 2025 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

mortent commented Apr 11, 2025

mortent commented Apr 11, 2025

k8s-ci-robot commented Apr 18, 2025

bg-chun commented Apr 18, 2025

jackfrancis Apr 18, 2025

mortent Apr 23, 2025

pohly Apr 22, 2025

mortent Apr 23, 2025

pohly Apr 22, 2025

mortent Apr 23, 2025

k8s-ci-robot commented Apr 23, 2025

johnbelamaric Apr 24, 2025

mortent Apr 25, 2025

mortent commented Apr 24, 2025

jackfrancis Apr 24, 2025

mortent Apr 25, 2025

jackfrancis Apr 24, 2025

mortent Apr 25, 2025

jackfrancis Apr 24, 2025

mortent Apr 25, 2025

jackfrancis Apr 25, 2025

jackfrancis commented Apr 24, 2025

pohly commented Apr 25, 2025

dom4ha left a comment

dom4ha Apr 25, 2025

mortent Apr 25, 2025

dom4ha Apr 25, 2025

mortent Apr 25, 2025

dom4ha Apr 25, 2025

mortent Apr 25, 2025

k8s-ci-robot commented Apr 25, 2025


		### Implementation

		The DRA scheduler plugin will flatten the counter sets and devices before

KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

Are you sure you want to change the base?

KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238

Conversation

mortent commented Apr 11, 2025

mortent commented Apr 11, 2025

k8s-ci-robot commented Apr 18, 2025

bg-chun commented Apr 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortent commented Apr 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackfrancis commented Apr 24, 2025

pohly commented Apr 25, 2025

dom4ha left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 25, 2025