(disaggregatedset): Fix DisaggregatedSet rolling update drain ordering and maxSurge=0 handling#833
(disaggregatedset): Fix DisaggregatedSet rolling update drain ordering and maxSurge=0 handling#833hasB4K wants to merge 3 commits into
Conversation
When maxSurge=0 forces drain before scale-up, combine both operations
in a single step to avoid capacity dips. Add orphan prevention logic
to ensure all roles drain to 0 together - if any role would drain to 0
alone, either accelerate all roles to drain together (if safe) or hold
that role at 1 until all can drain safely.
Add canDrainAllToZero() to check maxUnavailable constraints before
coordinated drain, and applyOrphanPrevention() to centralize the logic
in both tryProportionalDrain and tryForceDrain paths.
Fixes asymmetric scenarios like {5,2} -> {5,2} with maxUnavailable=1.
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
extractRollingUpdateConfig used `v > 0` to detect whether maxSurge was explicitly set, but this couldn't distinguish "not set" from "set to 0". When maxSurge=0 was configured with maxUnavailable>0, the default maxSurge=1 was silently kept, causing the planner to use batchSize=1 instead of batchSize=maxUnavailable. This resulted in 1-by-1 draining instead of batched draining (e.g. 244 steps instead of 106 for a 122-prefill/72-decode rollout). Allow maxSurge=0 when maxUnavailable>0. Add e2e test for surge=0 with unavailable=2.
When multiple old workload revisions exist (A→B→C scenario), drain the newest old revision first. This preserves the stable, longest-running workload (A) as a capacity safety net while removing broken intermediate revisions (B) first. Also fixes the sort key to use max timestamp across all roles instead of only the first role's timestamp, making ordering robust when roles are recreated independently.
✅ Deploy Preview for kubernetes-sigs-lws ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hasB4K The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @hasB4K. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
@hasB4K: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
What this PR does / why we need it
Special notes for your reviewer
This PR doesn't need to be merged.
Does this PR introduce a user-facing change?
No.