Skip to content

(disaggregatedset): Fix DisaggregatedSet rolling update drain ordering and maxSurge=0 handling#833

Draft
hasB4K wants to merge 3 commits into
kubernetes-sigs:mainfrom
hasB4K:fix-drain-order-newest-first
Draft

(disaggregatedset): Fix DisaggregatedSet rolling update drain ordering and maxSurge=0 handling#833
hasB4K wants to merge 3 commits into
kubernetes-sigs:mainfrom
hasB4K:fix-drain-order-newest-first

Conversation

@hasB4K
Copy link
Copy Markdown
Contributor

@hasB4K hasB4K commented Apr 29, 2026

What type of PR is this?

What this PR does / why we need it

  • Fix coordinated drain: combine drain + scale-up in a single step when maxSurge=0 to avoid capacity dips; add orphan prevention so all roles drain to 0 together
  • Drain newest-first: when multiple old revisions exist (A→B→C), drain the newest old revision (B, the broken intermediate) before the stable original (A), preserving capacity safety
  • Fix maxSurge=0 ignored: extractRollingUpdateConfig couldn't distinguish "not set" from "set to 0", silently defaulting to maxSurge=1 and causing 1-by-1 draining instead of batched.

Special notes for your reviewer

This PR doesn't need to be merged.

Does this PR introduce a user-facing change?

No.

hasB4K added 3 commits March 31, 2026 16:42
When maxSurge=0 forces drain before scale-up, combine both operations
in a single step to avoid capacity dips. Add orphan prevention logic
to ensure all roles drain to 0 together - if any role would drain to 0
alone, either accelerate all roles to drain together (if safe) or hold
that role at 1 until all can drain safely.

Add canDrainAllToZero() to check maxUnavailable constraints before
coordinated drain, and applyOrphanPrevention() to centralize the logic
in both tryProportionalDrain and tryForceDrain paths.

Fixes asymmetric scenarios like {5,2} -> {5,2} with maxUnavailable=1.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
extractRollingUpdateConfig used `v > 0` to detect whether maxSurge was
explicitly set, but this couldn't distinguish "not set" from "set to 0".
When maxSurge=0 was configured with maxUnavailable>0, the default
maxSurge=1 was silently kept, causing the planner to use batchSize=1
instead of batchSize=maxUnavailable. This resulted in 1-by-1 draining
instead of batched draining (e.g. 244 steps instead of 106 for a
122-prefill/72-decode rollout).

Allow maxSurge=0 when maxUnavailable>0. Add e2e test for surge=0
with unavailable=2.
When multiple old workload revisions exist (A→B→C scenario), drain the
newest old revision first. This preserves the stable, longest-running
workload (A) as a capacity safety net while removing broken intermediate
revisions (B) first.

Also fixes the sort key to use max timestamp across all roles instead
of only the first role's timestamp, making ordering robust when roles
are recreated independently.
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 29, 2026

Deploy Preview for kubernetes-sigs-lws ready!

Name Link
🔨 Latest commit 26fb4c1
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/69f2651f4e349b0008798b73
😎 Deploy Preview https://deploy-preview-833--kubernetes-sigs-lws.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hasB4K
Once this PR has been reviewed and has the lgtm label, please assign kerthcet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 29, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @hasB4K. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 29, 2026
@hasB4K hasB4K marked this pull request as draft April 29, 2026 20:09
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2026
@yankay
Copy link
Copy Markdown
Member

yankay commented May 1, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@hasB4K: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lws-verify-main 26fb4c1 link true /test pull-lws-verify-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants