-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-4815+5234: DRA Update 4815 and split out 5234 for mixins #5238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mortent
commented
Apr 11, 2025
- One-line PR description: Update ParitionableDevices KEP and create separate ResourceSliceMixins KEP
- Issue link: DRA: ResourceSlice Mixins #5234
- Issue link: DRA: Partitionable Devices #4815
- Other comments: This updates 4815 to reflect the functionality that was actually implemented for 1.33. The mixins feature was originally part of 4815, but go cut from the scope for 1.33. So this moves that functionality into a separate KEP 5234.
bf1d8cf
to
8f1251c
Compare
/wg device-management |
@bg-furiosa: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lgtm |
// The mixins referenced here must be defined in the same | ||
// ResourceSlice. | ||
// | ||
// The maximum number of includes is 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of maximums declared as comments in various struct properties. Does it make sense to justify these in the KEP proposal language so that there is a clear historical discovery trail for why these maximums were originally chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've added more details in both KEPs under the Will enabling / using this feature result in increasing size or count of the existing API objects?
sections. But it is getting somewhat complicated, so I think we should take a step back and make sure we did get this right. I will follow up on that.
type. The new function will be offered through the newly added fields under `BasicDevice`. | ||
The kube-scheduler is expected to match the kube-apiserver minor version, | ||
The API extensions proposed in this KEP are added as new fields on the | ||
`ResourceSliceSpec` and `Device` types. All the fields will be behind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: there's one oddity about KEPs. When a KEP is initially written, it uses future tense ("we will add tests"). But at some point the feature is implemented and the KEP merely serves as documentation for what has been actually done. At that point, using "we added tests" would be more appropriate.
I don't know of any "best practice" for this. What I have done in my KEPs is that when I touched some sections, I changed the wording, but it wasn't necessarily consistent overall.
Perhaps present tense would be best? It kind of works for the initial revision and when reading it later as documentation.
In this paragraph, we have present tense ("are added") and future tense ("will be")...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the language in this section to avoid both present and future tense.
|
||
We have discussed adding a kubectl command or a plugin that will allow | ||
users to see the fully flattened versions of a ResourceSlice. But this | ||
is not in scope for alpha. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more risk: by allowing devices and counter sets that have potentially more entries, the worst-case scenarios for scheduling becomes worse.
I think that's fine and we can add the usual disclaimer that "it depends on what DRA driver authors decide to use", which limits the potential for abuse because normal users can't trigger it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
New changes are detected. LGTM label has been removed. |
4e339c7
to
2558823
Compare
c40743b
to
20b3c07
Compare
be reduced and a larger number of devices can be published within a single | ||
ResourceSlice. | ||
- Enable defining devices with more attributes, capacities, and consumed counters. | ||
- Enable defining counter sets with more counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things we should consider (not sure if these are goals or implementation choices): 1) enabling mixins to be per-pool not per-resource slice; 2) enabling counters to be per-pool not per-resource slice.
If necessary, these could be considered in beta. But I do think we're going to need them in time. The second may belong in partitionable not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have an issue for that in kubernetes/kubernetes#130785. We definitely need to make a decision on this in this cycle, as I think changing this must happen over two releases. I was hoping to handle this separately from this KEP, but open to including it here.
This KEP adds a few new limits on the size of slices/maps in the ResourceSlice in addition to the ones that were added as part of the Partitionable Devices and other features. But as I've tried a few scenarios I've realized the result is not great, as it makes it possible to add more attributes, capacity and counters through mixin without actually reducing the storage size of the ResourceSlice. An example is that we currently limit the total number of counters across the counter sets in a ResourceSlice to 32. As a result, it is impossible to create a counter set with more than 32 counters. But with mixins, I can create a counter set mixin that is only referenced from a single counter set to create larger counter sets. This doesn't reduce the number of counters defined in a ResourceSlice, it just forces users to "abuse" mixins to bypass the limit. We should set the limits based on the total number of attributes, capacity, and counters across the ResourceSlice, rather than based on whether they are defined in a Device or a Mixin. I suggest we set the limits to something like:
So no special limits for mixins, they count against the same limits as the properties defined in devices. With these limits the worst case size for the ResourceSlice increases from 1,107,864 bytes to 1,288,825 bytes as a result of adding mixins. I think changing the limits for the counters should be pretty straightforward since those only affects fields that are still in alpha. So we can add the new ResourceSlice-wide limits and remove the more granular ones. |
#### More attributes, capacities and counters might worsen worst-case scheduling | ||
|
||
With mixins, DRA driver authors can choose to create more complex devices, | ||
which might lead to worse scheduling performance. But it is up to DRA driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might replace "But it is up to DRA driver authors to do this, so it is not something that can be triggered by normal users." with something like:
"This will not negatively effect existing scheduling performance of existing ResourceSlice definitions, but DRA driver authors taking advantage of mixins should be made aware of possible performance effects due to this increased referential complexity. Furthermore, this reinforces the criticality of ensuring that DRA primitives are optimized for maximum performance."
(That's my takeaway, at least.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update the KEP to include your suggestion with some small edits.
// The main purposes of these mixins is to reduce the memory footprint | ||
// of devices since they can reference the mixins provided here rather | ||
// than duplicate them. | ||
type ResourceSliceMixins struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this initiates a lengthy, annoying naming debate! But: because this is a struct would it be preferable to call this ResourceSliceMixinSpec
instead (and then we'd ostensibly update the changes to ResourceSliceSpec
to include a new MixinSpec
property (instead of a Mixins
property).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question, but when I look at the existing fields on the ResourceSliceSpec
, they don't use the ...Spec
naming convention in the field name or the type. So for consistency, I think we should just follow the pattern and keep the current names.
### Implementation | ||
|
||
The DRA scheduler plugin will flatten the counter sets and devices before | ||
going through the allocation process. This will happen as part of conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting my SIG Autoscaling hat on, is this a sufficient description of the plan to provide a standard interface for rendering the "flattened" ResourceSlice (after following and processing all mixin references)? In the worst case scenario those conversions happen surgically throughout various parts of the k/k codebase, which would make it hard for downstream components like cluster-autoscaler and karpenter to plumb into durable, reusable libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is a good question. Every tool that needs to understand the full device definitions will need to flatten the mixins, but the suggested implementation here doesn't lend itself easily to reuse as the flattening happens as part of conversion into the scheduler-specific format.
I added a separate section under "Risks and Mitigations" for this. The most obvious solution here is that we provide a library that handles the flattening, although I'm not sure which types (most likely v1beta2) we should do this for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @towca
Great initial thoughts, I would go ahead and move your thinking and preliminary conclusions into the KEP where it will probably get the most engagement. |
Yes, this would need ratcheting. We have already gradually moved away from individual per-slice and per-map limits towards aggregating at higher levels. You proposal is now basically to move this up to the root level of the entire slice. This makes sense to me and @thockin has approved the previous aggregated API limits, but it still is a bit unusual. Therefore I would like to hear from others what they think about taking this approach to the logical conclusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting these KEPs, as it's more self-contained and it's much easier to understand this enhancement now.
|
||
### Implementation | ||
|
||
The DRA scheduler plugin will flatten the counter sets and devices before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By flattening we still risk that large number of references combined with a large size of mixing could cause scheduler OOM, as there would be no mechanism of keeping the consumption under control.
I think the in-memory representation should stay as is, but the allocator should iterate over mixins somehow. Is that feasible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but it means additional work to dereference the mixins every time they are needed and more complexity in the allocator to handle it.
I also think the question around memory usage in the scheduler goes beyond just mixins. The memory usage per device does matter, but so does the number of devices. Currently we allow a maximum of 127 devices per ResourceSlice, but there is no other limit on the number of ResourceSlices than the number of objects for a single type in Kubernetes. We can make changes to the allocator to make sure we don't try to keep all devices in memory at the same time, but that comes with other challenges.
We probably should look at whether we should place a limit on the number of devices in a cluster and then see what kind of impact that has on the memory usage of the scheduler.
the devices that they manage in ResourceSlices. This information is used by the | ||
scheduler when selecting devices for user requests in ResourceClaims. | ||
|
||
With this KEP, DRA drivers can define metadata in mixins separately from specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this topic might have been discussed already, but why don't we use a word templates instead of mixins? Wouldn't it be more straightforward naming convention that would be easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by word templates here. Could you provide an example?
type ResourceSliceSpec struct { | ||
... | ||
|
||
// Mixins defines the mixins available for devices and counter sets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand the comment to clearly define the purpose of mixins and how they will be merged with other attributes (how possible conflicts are handled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is described in the comments on the Includes
fields on CounterSet
, Device
, and DeviceCounterConsumption
types. I think that is the right place to document this, as the order of the mixins listed matters for conflicting properties.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d2deb9a
to
3eddab7
Compare