Skip to content

Conversation

@wojtek-t
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 28, 2025
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2025
@wojtek-t
Copy link
Member Author

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 672aa68 to ce04eca Compare December 1, 2025 08:21
@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from ce04eca to 0ff3958 Compare December 1, 2025 08:52
Comment on lines 385 to 387
1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Contributor

@44past4 44past4 Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.

Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.

Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.

Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.

However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.

The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point that out!

Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.

That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.

Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

@sanposhiho
Copy link
Member

/assign

Copy link
Contributor

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this, and I like how it is decoupled from the other work planned for 1.36.

can't reprieve any of those, learning about that would require O(N) full workload schedulings
with N being number of workload/pods violating PDB.
<<[/UNRESOLVED]>>
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that nodes are either high-pod-per node count, or low pod-per-node count. Its a bimodal distribution.

Let's further assume that if Gang scheduling is used, then the node is going to usually be low pod-per-node count.

So, then we can do the following:

  1. Individual Pod as preemptor - assume high pod-per-node, use current algorithm, which is optimized for many pods per node, consider all victims.
  2. Gang as preemptor - assume low pod-per-node in all cases, consider a maximum of e.g. 4 reprieves per node, to keep compute time down, and just stop reprieving in the case where there are more things on the node.,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.

Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.

In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.

PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this to Needs Review in SIG Scheduling Dec 2, 2025
@xigang
Copy link
Member

xigang commented Dec 3, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from xigang December 3, 2025 00:55
1. From remaining potential victims, we start to reprieve pods starting from the highest priority
and working down until the set of remaining victims still keeps the node feasible.

Once we compute the feasibility and list of victims for all nodes, we score that and choose the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it's possible that we will not do that for all nodes in the cluster. We find feasible nodes until we have max(numNodes * 0.1, 100) nodes from which we can choose from: https://github.com/kubernetes/kubernetes/blob/ec1bf8a4f3a5f054065225dc8275c66b93310d17/pkg/scheduler/framework/preemption/preemption.go#L363-L364

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated (although I don't think it changes anything for this particular proposal).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not for the initial implementation but it's worth to keep it in mind once we look into the scalability of workload preemption

- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W

1. If removing all the potential victims would not make the new workload W schedulable,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should point out that this depends on workload aware scheduling which is not yet implemented and is planned for 1.36.

1. If removing all the potential victims would not make the new workload W schedulable,
the workload is unschedulable even with preemption.

```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you need to indent this "code block" to keep the numbering continuous.


1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a workload and an individual pod, where only one is needed to make the new workload schedulable. Which one will be chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if they have the same priority then: single pod > pod from workload with gang preemtable = false > workload with gang preemtable = true?

Comment on lines 478 to 484
1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
`WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
`GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
already available to use. The implementation will work similarly to `WaitOnPermit` to
ensure that `GetResources` was executed for all pods from within a `PodGroup`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the preemption targets be released when we after all don't run the RunGetResourcesPlugins? For example, when a gang turns out being unschedulable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very good question. I think we want something conceptually similar to "Reserve/Unreserve" pattern from DRA.

So scheduling phase will effectively serve as "reserve" phase and we we will have a sibling method of "unschedule" that will be able to re-assume the victims.

It requires some description though.

We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
becomes a challenge, thus we modify to the approach below.

To check if a workload W can be scheduled on a given cluster with preemption we:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we talk about a "gang pod group" rather than a "workload"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion here - let me change it.

@Argh4k
Copy link

Argh4k commented Dec 4, 2025

Do we want to add as a part of this KEP a description of how the preemption fits the workload aware scheduling (codewise)? Or do we want to have this other way around, have the KEP for workload aware scheduling reference this one when talking about preemption?

In the gang scheduling KEP we talk about adding a "Workload" phase where we will end up with a pods from Gang with a nominated node names. I assume that this preemption will be a part of this phase. The open question is what actually will be the outcome of the preemption:

  • will the workload premption trigger the preemption, counting on delayed preemption to actuate it
  • will the workload preemption mark pods for preemption and the trigger will be done by the current preemption in the pod post filter? This is actually a preferred option by me as it will also take into consideration changes that happened in the cluster between workload scheduling cycle and pod scheduling.
  • something else?


As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
preemptions. However, this is not true for the current gang scheduling implementation.
In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

Great point - I updated this paragraph to reflect that.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

I hope that an update KEP for gang scheduling that will describe the workload scheduling phase will be opened pretty soon and it will describe it. And I will be able to just link to it here :)
@macsko ^^

1. New field in the workload object (delayed preemption will not bring much value in
case of scheduling individual pods, though there would be significant benefit from
unification, so probably this isn't ideal option).
1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not allow external schedulers to use the same concept for victims nomination.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to address most of the comments, I will try to respond/address the remaining ones later today/tomorrow.

1. From remaining potential victims, we start to reprieve pods starting from the highest priority
and working down until the set of remaining victims still keeps the node feasible.

Once we compute the feasibility and list of victims for all nodes, we score that and choose the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - updated (although I don't think it changes anything for this particular proposal).

Comment on lines 385 to 387
1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
becomes a challenge, thus we modify to the approach below.

To check if a workload W can be scheduled on a given cluster with preemption we:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinion here - let me change it.


1. Identify the list of potential victims:
- all running workloads with (preemption) priority lower than the new workload W
- all individual pods (not being part of workloads) with priority lower than the new workload W
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

1. New field in the workload object (delayed preemption will not bring much value in
case of scheduling individual pods, though there would be significant benefit from
unification, so probably this isn't ideal option).
1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 4293d98 to b24a962 Compare December 4, 2025 14:55
Copy link
Contributor

@erictune erictune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

can't reprieve any of those, learning about that would require O(N) full workload schedulings
with N being number of workload/pods violating PDB.
<<[/UNRESOLVED]>>
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erictune, wojtek-t
Once this PR has been reviewed and has the lgtm label, please ask for approval from sanposhiho. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wojtek-t, overall looks great!
I left a few questions.

- Define the scheduler changes needed to implement workload-aware preemption
- Provide full backward compatibility for all existing scheduling features

### Non-Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about partial preemption of a Workload?
I would imagine with DependsOn API in the JobSet that is something we should talk about at some point.
E.g. supporting Argo workflows in Kueue: kubernetes-sigs/kueue#74

cc @kannon92 @tenzen-y @mimowo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what exactly you mean here?

  1. We definitely don't want to require the whole gang to always be preempted together, this should be optional. This is reflected together.
  2. We don't yet want to allow arbitrary granularity, though I wouldn't exclude defining lower granularity units later.

But now sure if any of those actually is what you're seeking for with this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If in the future we allow users to preempted group of pods from gang, how that will work? Will we introduce a new API for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - we were considering the concept of PodSubGroup. I added the "Potential future extensions" section and sketched how this can be achieved later (as well as some other potential stuff).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar requests: kubernetes-sigs/kueue#3762 and kubernetes-sigs/kueue#975.

Let's say you have MinCount set to not max number of pods (say a Job with x parallelism and y MinCount where y < x). In theory you could have preempted down to MinCount. And the workload still satisfies gang requirement.

So I could see a case where this is useful for LWS or JobSet where maybe we can preempt entire replicated jobs or worker groups. And if the workload can tolerate that disruption it adjusts.

Now I see this more useful for deployments/serving as they usually could tolerate upscaling/downscaling easier than a batch workload. But I believe Ray and Spark can tolerate use cases like this.

But honestly this is a maybe not a trivial task itself and could be considered for future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, with Spark Dynamic Allocation feature that will be critical to have.
cc @bigsur0 @akshaychitneni @shravan-achar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to be clear - I'm 100% convinced that we will need another policies. And the policy of "preempt individual pods up to minCount and then the whole PodGroup" is absolutely a valid policy that I can imagine.

The API faciliates this extension and I the implementation can also be adjusted to that.
But the goal of this KEP is not to introduce all policies that we believe will be useful, but build the foundations and allow for those extensions later.
Clarified that in the goals/non-goals section.

Comment on lines 264 to 287
type GangSchedulingPolicy struct {
// Existing field(s).

// IsGangPreemptable defines whether all pods from this group should
// be preempted in all-or-nothing fashion.
IsGangPreemtable *bool
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we try to design an API that is future proof?
What if in the future we allow to partially preempt group of pods from gang for elastic training?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how I was thinking about it:

  • at some point we will introduce "PodSubGroup" as it was described in the original gang scheduling doc: https://tiny.cc/hvhs001 (whatever the name will be)
  • at this point, PodSubGroup may actually become the preemption unit
  • we will have a corresponding boolean flag at the level of PodSubGroup at this point
  • if you want PodSubGroup be the preemption unit, you will set that field instead of setting it here

The above model will be compatible with this addition.

If that doesn't address your usecase, can you please explain your usecase in more detail?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodSubGroup sounds great! Do we have any tentative API design for it ?
Are we planning to introduce this as part of PodGroup API object?

type PodGroup struct {
    Name *string
     ...

    PodSubGroups []PodSubGroup
}

I am also curious what if in the future, someone want to preempt multiple PodSubGroups within single a PodGroup ?

Just like an idea, we can introduce PreemptionPolicy API which can describe such groups:

type Workload struct {
	PreemptionPolicy *PreemptionPolicy
}

type PreemptionPolicy struct {
	PriorityClassName           *string
	PreemptionPriorityClassName *string
	TargetPodGroups             []PreemptionGroup
}

type PreemptionGroup struct {
	// Name of the group.
	Name string

	// Target PodGroup or PodSubGroup name to be preempted together.
	TargetPodGroup []string
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodSubGroup sounds great! Do we have any tentative API design for it ?

It was described in the original doc as future extensions:
https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?tab=t.3pkx7y4zvho2
(you're co-author there :) )

Regarding PreemptionPolicy - the reason why I didn't go with that is that I believe that it doesn't make sense if we aren't gang-scheduling (so it only make sense in gang-scheduling mode). I can't imagine any usecase where preemption unit is larger than scheduling unit - and without a gang policy scheduling unit is an individual pod.

I'm happy to adjust the API but we need to take that somehow into account and cross fields validations are always more confusing to users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right!

I can't imagine any usecase where preemption unit is larger than scheduling unit

That is a good point, however how can we preempt multiple gangs within the workload together?

Let's say our Workload is a workflow that contains multiple steps, some steps use gang policy, some of them don't.
Additionally, users might want to preempt only desired steps from the workflow.

How would preemption work in that case?

Copy link
Member

@andreyvelich andreyvelich Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kannon92 @tenzen-y @astefanutti

In case you are also interested in the two-layer scheduling problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think we could consider skipping workloads for JobSet and have each job be a separate workload scheduling.

So we let Job controller create and schedule as is and JobSet may not be treated as a whole gang.

I don't know if this needs to be in scope here. I think if job gets preempted under a JobSet JobSet will keep reconciling state,

Copy link
Member

@andreyvelich andreyvelich Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could end up with a mixed situation where some ReplicatedJobs should be preempted together, but others should not from a single JobSet.

For example, consider a JobSet with three ReplicatedJob templates: Initializer, MPI Launcher, and MPI Worker.
The Initializer job should be scheduled first. After it completes, the MPI Launcher and MPI Worker jobs should be scheduled together as a single gang.

How the Workload resources look like in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Honestly I'm not entirely sure everything will be solved with this PR.

JobSet knows a lot more about how these different pods should be scheduled than the kube-scheduler.

It feels like this could be a future KEP. ala WorkloadSequence.

Worst case, a JobSet author can schedule everything together and let jobset handle orchestration via dependsOn.

I think preemptive/smart batch workloads is a hard problem and maybe for first draft we focus on preempted the workloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, consider a JobSet with three ReplicatedJob templates: Initializer, MPI Launcher, and MPI Worker.
The Initializer job should be scheduled first. After it completes, the MPI Launcher and MPI Worker jobs should be scheduled together as a single gang.

I guess I see two "gangs". Initializer is one. And Launcher and Worker are a separate gang that has to start after Initializer is ready.

So dependsOn with Initializer Ready.

Jobset creates Initiailzer workload request and uses dependsOn to wait for that Job to be ready.

And then JobSet creates Launcher/Worker as a separate gang.

But I don't see why/how this should block this KEP. Scheduler should be aware of gangs and preempt them. And workload controllers can figure out how the gang should work in the context of the API.

#5547 seems to be more in scope than this KEP IMO.

// IsGangPreemptable defines whether all pods from this group should
// be preempted in all-or-nothing fashion.
IsGangPreemtable *bool
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered whether something similar to preemptionPolicy: Never makes sense for Workloads? Do we know whether there are use cases for a workload that should just wait for the place on the cluster without preempting other pods/workloads but it also requires the whole gang to start at once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I can definitely imagine usecases where it makes sense (CI workloads as an example).

But this is also a power of not reinventing the concept of priority from scratch and using existing PriorityClasses - by using it at the workload level, we effectively get all of its features roughly for free.

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from b24a962 to 873a281 Compare December 5, 2025 13:04
we want to support and is compatible with the current pod-based preemption algorithm. This means
we will be able to achieve in-place replacement with relatively localized changes.

### Delayed preemption
Copy link
Member

@sanposhiho sanposhiho Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it would make one big difference from our current preemption: What if the pod's deletion take time and meanwhile there could be other places to be available for this workload?
Today, a preemptor pod doesn't wait for victim pod deletion(s) to be completed. That helps, if other places become available meanwhile, a preemptor pod could still be scheduled there.
The time to complete the deletion could be a lot longer/worse when it comes to a victim workload because a victim workload could contain thousands of pods. (Also, it's typical for ML clusters that victim pods have to do something fancy (checkpointing etc) at the termination.)
The current proposal looks like a preemptor pod is just going to be blocked WaitForGetResources? But, is it really ideal that a high priority preemptor workload might have to wait for a long time to get all victim pods deleted, while on the other hand there might be some new empty spaces in the cluster where the preemptor workload can be scheduled actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great point - I thought about it in the past, but later completely forgot it.

One point that I would phrase differently is: I'm less concerned about workload consisting of multiple pods. This is true, but on its own it generally will not be the primary factor for problems.
The primary factor for problems most often will be grace period and that is a problem independently whether we're preempting whole workload or just an individual pod.

When I thought about it in the past, the only reasonable answer I came up with was:

  1. let's introduce a timeout and if all the victims are not preempted within that timeout, we return all the pods back to the scheduling queue
  2. However, it should not result in clearing up the nominatedNodeName for these - in other words we continue to assume that they are still waiting for preemptions to happen and be scheduled there
  3. in this case we try to schedule them again - if we can schedule them without preemption we go for it, if preemption is still required we actually don't change it

I think that works on paper and seems good enough conceptually, but the question is whether it doesn't break some implementation assumptions that I'm not aware of.
@sanposhiho @macsko @dom4ha @tosi3k

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a timeout actually? I mean, today's scheduler behavior with the pod preemption equals to where we set timeout=0 on your explanation, right? because we just return pods to the queue immediately today.
Today's behavior makes more sense to me because if a huge workload finishes immediately after some workloads triggered preemptions, and there's a huge empty space in the cluster, I believe we want to just schedule those pending workloads there without waiting for new victim workloads to be terminated or they reach the timeout and returned to the queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, today's scheduler behavior with the pod preemption equals to where we set timeout=0 on your explanation, right? because we just return pods to the queue immediately today.

I think not fully. So the part of "return immediately to the queue" is correct (so in that sense it equals to timeout=0 case).
But unless I'm missing something currently we're always running full scheduling cycle now, so if we realize that it's better to place me on some other node N2 instead of previously chosen node N1 (even though we already triggered preemption on N1), we can do that.
What I described is changing that: we no longer run try preemption (PostFilter) in further attempts - we may change the placement if there is already available one, but if we need to preempt something, we rather stick to the original placement.

I think that timeout=0 makes sense, but we should try to avoid triggering unnecessary preemptions.

Copy link
Member

@sanposhiho sanposhiho Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer run try preemption (PostFilter) in further attempts

But, what if some higher priority pods/workloads already took over some places? So, I believe we should actually do that. So, my overall thoughts are:

  1. workloads are immediately going back to the queue after triggering the preemption.
  2. workloads can retry scheduling while waiting for all victim pods to complete. It allows workloads to be scheduled asap, without waiting for victim pods' termination if possible.
  3. workloads can try preemption at those retries, but it should take all on-going preemption into consideration and should try not to make any unnecessary preemptions wisely.

I'm imagining an example specific scenario like:

  1. workloadA triggers preemption. Pod#1 ... Pod#4 will be deleted. workloadA goes back to the queue immediately.
  2. Pod#1 is terminated pretty soon. Other Pod#2 .. Pod#4 are still being terminated.
  3. A higher priority workloadB took the empty place made by Pod#1 termination simply because it's higher priority than workloadA.
  4. workloadA is rescheduled for some reason (some cluster events, e.g., a new node is added etc).
  5. The scheduler still runs try preemption. It "tries to" pick up just one pod (assuming all pods are the same size to simply this example) because the place for just Pod#1 is already taken by workloadB. This preemption should be aware of the fact that workloadA is still waiting for Pod#2 .. Pod#4 and hence should try not to make any further unnecessary preemptions. Speaking of the implementation, when selecting the victims, it should first prioritize scheduling the pods from workloadA onto the domain/nodes that Pod#2...Pod#4 are running.

The reason I stressed "try to" at (5): At that point, workloadA might end up needing to preempt the whole different set of pods on different domain because it might not be able to find a new victim pod on the domain of Pod#2...Pod#4. In this case, it has to preempt different 4 pods, but that is NOT unnecessary preemptions because workloadA won't be schedulable after Pod#2...Pod#4 are terminated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workloads can try preemption at those retries, but it should take all on-going preemption into consideration and should try not to make any unnecessary preemptions wisely.

Sure - that was my point (just not stated clearly). We need to take into account that the original placement may no longer be valid one.
But we shouldn't try to arbitrarily choose a different place that requires different preemptions if the original one is still valid - we're just waiting for preemptions to finish.
So as example, if I previously preempted workloadA and once it is preempted we will have a place to run our workload, in the second attempt we shouldn't try to preempt workloadB running in a completely different place because that space would be scored higher.
We should treat "triggerred preemptions" as "free space" and trigger additional preemptions only if the already triggerred ones are not enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think we're on the same page here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think I'm on the same page but then the question is whether we really need the GetResources plugin? I understand that it is necessary for current implementation of gang scheduling and according to this discussion it should actuate preemption + put back pod to the queue (so it has chance to take other spot if we get new free space), but in a world where we have separate workload scheduling phase, which includes this workload preemption phase, should we actuate preemptions after this phase, together with putting the pods to the queue? Or do we want to have this flow: calculate NNN and preemptions for all pods from workload in workload scheduling/preemption -> run pod by pod scheduling to confirm placement -> when pod reaches GetResources trigger preemption and put it back on queue? In that case the pod would get through the scheduling phase 2 times + 1 time through workload scheduling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pod-by-pod scheduling provides a final confirmation that the placement works, so I think we want to have that at least in the foreseable future.

I have update the KEP to reflect that better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho @dom4ha @macsko @Argh4k - PTAL at the updated version

Comment on lines 271 to 272
// IsGangPreemptable defines whether all pods from this group should
// be preempted in all-or-nothing fashion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is unclear: What means if it's false? whether it can be preempted partially or it cannot be preempted at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch to enum should make it cleaner now.


// IsGangPreemptable defines whether all pods from this group should
// be preempted in all-or-nothing fashion.
IsGangPreemtable *bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should consider enum here instead of bool? like preemptionPolicy: gang || never, and we can add new value(s) later? e.g., partially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't introduce Never here explicitly - we can use PriorityClass for that where we are already able to represent these kinds of concepts:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2794-L2801

But overall - the comment about switching to enum makes sense to me - for now we will have:
individual pod (default) and gang (better names needed) and eventually we can extend it further.

I will adjust that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to enum.

- it needs to preempt one of the already running workloads
- workload B has scheduling priority `med` but preemption cost `low`
- workload C has scheduling priority `low` but preemption cost `high`
In such case, the preemption cost would result in choosing workload B for preemption. But
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flyby comment: wouldn't it be sufficient to target lowest priority workloads possible and use cost only as a tie breaker?

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 873a281 to 6a2cc3f Compare December 12, 2025 13:16

const (
// PreemptionModePod means that individual pods can be preempted independently.
PreemptionModePod = "Pod"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When gang scheduling policy is configured with PreemptionModePod, is it correct to assume that pods are evicted individually until the PodGroup size drops below minCount, at which point the remaining pods are preempted collectively?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's the case. I expect that those actually do not care about minCount and they are either allowing pod by pod preemption or preemption of the whole pod group no matter what the size of pod group is in relation to its minCount. Maybe the behavior you describe can be added as a third option?

Btw, there is an ongoing discussion about exact minCount semantics in workload scheduling doc #5730 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I expanded the comment trying to clarify this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like there's a middle ground between "any pod can be preempted, ignoring the atomicity of the pod group" and "preempt the whole podgroup". In fact it would seem pretty common that you might need to preempt a few pods from one group to make space (again thinking topological constraints) for a new workload, and that those pods from that podgroup would still need a place to start back up. In other words, the podgroup still needs to be able to be made complete, otherwise the cost metric should be "whole podgroup lost".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I definitely agree that more policies will be needed.
I just don't want to invent every potential policy as part of this KEP - we can have follow-up KEPs (or extensions to this one) in introduce more policies - the API structure should faciliate this extensibility.

Comment on lines +503 to +505
1. For the remaining potential victims, using binary search across priorities find the minimal
priority N for which scheduling the preemptor can be achieved without preempting any victims
with priority higher than N. This allows to reduce the potential cascaiding preemptions later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it is possible that the only feasible solution might require evicting pods with different priorities, which a binary search would miss because the problem is fundamentally combinatorial. This can happen when the preemptor has additional constraints, such as affinity, that can only be satisfied by removing specific set of pods rather than those that naturally fall out of the current sort order. That said, it might cause cascading preemptions later on, so it ultimately comes down to a trade-off. I thought it was worth highlighting, even though I’m unsure whether a fully robust algorithm can be implemented within the current scheduling framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, it is possible that the only feasible solution might require evicting pods with different priorities, which a binary search would miss because the problem is fundamentally combinatorial.

Maybe the description is not clear, but a binary search at level X assumes "we preempt everything with priority X or lower". So it we need to evict pods of different priorities we just try to minimize the max one and that's what this binary search does. So that works.

That said, it might cause cascading preemptions later on, so it ultimately comes down to a trade-off.

Sure - we use heuristicts like the above to minimize them, but we're not attempting to find the ideal solution.

Comment on lines +340 to +341
There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
may no longer reflect the actual pod priority, which could be misleading to users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the current users consume this? For example, Graceful Node Shutdown uses priorities to decide on the order of termination of pods. I suspect it is used in other node drain/maintance scenarios such as picking the least disruptive node priority-wise.

Which priority should be the chosen one for disruption? Should it be Pod, Workload, or PodGroup?

Comment on lines 324 to 331
PriorityClassName *string

// PreemptionPriorityClassName, if specified, indicates the workload's
// priority that should be used when attempting to preempt this workload.
// If not specified, it will default to PriorityClassName.
//
// This field is mutable.
PreemptionPriorityClassName *string
Copy link
Member

@atiratree atiratree Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, PreemptionPriorityClassName could be useful in other scenarios (e.g. GNS). Maybe,DisruptionPriorityClassName would be more fitting?

Comment on lines 415 to 420
<<[UNRESOLVED priority status]>>
We should introduce/describe `workload.status` to reflect:
- actual priority (a number) that is inferred from the priority class
- express the preemption priority that is currently used by scheduler to reflect that
it aknowledged the change
<<[/UNRESOLVED]>>
Copy link
Member

@atiratree atiratree Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the authoritative source be for priorities for all kinds of disruptions?

Have we considered adding this information to the Pod's status? For example, .status.disruptionPriority which would be reconciled according to all the available information (Pod's .spec.PriorityClassName and .spec.priority, Workload's .spec.preemptionPriorityClassName and .spec.PodGroups[].preemptionPriorityClassName)

Another option would be to track the priorities for each PodGroup and the final Workload priority in the Workload status. For now, we could just copy the .spec.preemptionPriorityClassName resolved integer priority for both the pod groups and the final one. We could extend the space later for better granularity and strategies for computing the aggregated priority.

Introducing one of these options would improve observability and enable simulation of preemption and other disruption scenarios.

// priority that should be used when attempting to preempt this workload.
// If not specified, it will default to PriorityClassName.
//
// This field is mutable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there e a pressing use case that we have to cover here? FYI, we have received a lot of negative feedback about the pod-deletion-cost feature over the years. One issue is that updating a cost (in this KEP's case a priority) is not scalable, especially when considering a large number of pods that are updated often. This issue should not be that pronounced in the Workload, since it groups a set of pods, but do we need to make this field mutable in alpha?

One case for mutability is periodic checkpointing to prevent disruption until it completes. Presumably, the priority will return back to normal afterward, and the workload will probably get disrupted.

An alternative to this is the EvictionRequest, which lets the interceptors complete the checkpointing. It is more versatile because it can also trigger checkpointing. Since EvictionRequest will be used in other disruption scenarios, it would be ideal if we could use the same mechanism. This needs more analysis, but we could add an observable disruption priority to the interceptors, for example.

In essence, setting a critical priority to PreemptionPriorityClassName is similar to using a blocking interceptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are multiple potential usecase for deletion - the fact that we don't need to set it at individual pod level finally makes that feasible.

  2. Regarding mutability - checkpointing is my primary usecase. Using EvictionRequest for that is certainly an option, but even know we're not using "evict" API for preemption e.g. due to incompatibility with PDBs.
    I would definitely like to eventually use EvictionRequest, but I don't want to block on that to happen.

That said, we can definitely consider moving mutability to post-Alpha if there are doubts here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering it for the post Alpha. We will see how much progress we make with EvictionRequest in 1.36, and then we can decide the best way to make it all work together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about it, I think this actually is purely additive feature. Given that it adds non-negligible complexity, visibility (having clear status to reflect the state is must-have), etc., I actually removed it from the scope of this KEP.
We should have a separate dedicated KEP for that.

So in this KEP, we're only introducing preemptionPriority concept but it is immutable.

1. If removing all potential victims would not make the preemptor schedulable, the preemptor
is unschedulable with preemption in currently considered domain D.

1. Sort all the potential victims to reflect their "importance" (from the most important to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about where the following scoreFuncs or other scoreFuncs come into action:

https://github.com/kubernetes/kubernetes/blob/c180d6762d7ac5059d9b50457cafb0d7f4cf74a9/pkg/scheduler/framework/preemption/preemption.go#L702-L714

Can we use the node scoring for the pod scoring as well, since we will consider all nodes at the same time for Workloads? I.e. run scoreFuncs on node from .spec.nodeName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are specific in a way that they already assume that we computed victims for a given node.
So we can't really use them at this point.

Where they come into play is the "top-level 3 point" where we score different scheduling decisions.

I'm definitely open to more sophisticated sorting algorithm here too, but let's postpone it after alpha to prove the model (changing the sorting function is relatively simple change).

PreemptionModePod = "Pod"
// PreemptionModePodGroup means that the whole PodGroup replica needs to be
// preempted together.
PreemptionModePodGroup = "PodGroup"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use this grouping in the EvictionRequest as well. Please see https://github.com/atiratree/kube-enhancements/blob/evacuation-api/keps/sig-node/4563-eviction-request-api/README.md#workload-api-support.

Do you expect preemption only specific behavior? Could we name this DisruptionMode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect preemption-specific bahavior yet.

I think that renaming it to DisruptionMode makes sense, but for now I added it as "unresolved" point above. I will wait for potential concerns from others.

```golang
// PreemptionMode describes the mode in which a PodGroup can be preempted.
// +enum
type PreemptionMode string
Copy link
Member

@atiratree atiratree Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to add a support for preempting/disrupting all PodGroups at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

PodGroups can be scheduled individually and I claim that if these are scheduled independently, they can also be preempted independently.
If the really should be preempted together, they should probably form a single gang.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this answers the same question for the EvictionRequest disruption modes. We will most likely use the same resolution and either disrupt whole PodGroup or individual pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a disrtributed program know whether is it being told to have one pod exit, or all pods? I suspect this varies by framework and application.

According to this doc, torchrun forwards sigterm to all other processes:
pytorch/pytorch#154849

For torch elastic, things seem more complex: pytorch/pytorch#67742

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 3854bbb to 305bad0 Compare December 17, 2025 15:09
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to respond to the remaining comments (and do a bit more adjustments to the KEP) tomorrow.

- Define the scheduler changes needed to implement workload-aware preemption
- Provide full backward compatibility for all existing scheduling features

### Non-Goals
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - we were considering the concept of PodSubGroup. I added the "Potential future extensions" section and sketched how this can be achieved later (as well as some other potential stuff).

```golang
// PreemptionMode describes the mode in which a PodGroup can be preempted.
// +enum
type PreemptionMode string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

PodGroups can be scheduled individually and I claim that if these are scheduled independently, they can also be preempted independently.
If the really should be preempted together, they should probably form a single gang.


const (
// PreemptionModePod means that individual pods can be preempted independently.
PreemptionModePod = "Pod"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I expanded the comment trying to clarify this.

PreemptionModePod = "Pod"
// PreemptionModePodGroup means that the whole PodGroup replica needs to be
// preempted together.
PreemptionModePodGroup = "PodGroup"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect preemption-specific bahavior yet.

I think that renaming it to DisruptionMode makes sense, but for now I added it as "unresolved" point above. I will wait for potential concerns from others.

Comment on lines 264 to 287
type GangSchedulingPolicy struct {
// Existing field(s).

// IsGangPreemptable defines whether all pods from this group should
// be preempted in all-or-nothing fashion.
IsGangPreemtable *bool
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does PodGroup indicate scheduling unit within the Workload API?

The scheduling uni can be a PodGroup (in case of gangs), but it can also be individual pods.

Let's say I have a JobSet with two ReplicatedJob that runs in a sequence (Initializer + Trainer). Trainer depends on Initializer completion.

The following structure doesn't reflect the fact that Trainer depends on Initializer completion - these will all be attempted to scheduled at the same time, or it may even happen that trainer will be scheduled and initializer won't be.

The workload API as currently designed doesn't have reasonable place to reflect sequencing. We will either need to design a higher-level concept (WorkloadSequence) and builds on top of Workload and e.g. Reservation or visibly reshape Workload anyway to accomodate it.

// priority that should be used when attempting to preempt this workload.
// If not specified, it will default to PriorityClassName.
//
// This field is mutable.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are multiple potential usecase for deletion - the fact that we don't need to set it at individual pod level finally makes that feasible.

  2. Regarding mutability - checkpointing is my primary usecase. Using EvictionRequest for that is certainly an option, but even know we're not using "evict" API for preemption e.g. due to incompatibility with PDBs.
    I would definitely like to eventually use EvictionRequest, but I don't want to block on that to happen.

That said, we can definitely consider moving mutability to post-Alpha if there are doubts here.

Why should this KEP _not_ be implemented?
-->

## Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 601 to 612
For pods being part of a workload, `PermitDisruptions` will be implemented by the already
existing `GangScheduling` plugin. The implemention will be a sibling to the existing
`Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
successfully nominated (or assumed) and only after satisfying this condition the preemptions
will be actuated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the PermitDisruptions be just called at the end of the workload scheduling phase, when it sees the workload can't be scheduled? Then, the actuation could be left to the DefaultPreemption plugin that computed the victims. I think this would simplify the logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase it because something doesn't parse for me in this comment

  1. PermitDisruptions conceptually should be called when (a) we can't schedule a workload/pod without preemption (b) we call PostFilter plugins (c) PostFilter let us find the placement for the workload/pod

  2. So did you mean "[...] when it see the workload can be scheduled?" instead?

If so, I would clarify it to "can be scheduled with preemption" and that conceptually makes sense.

So what we would do is effectively to somewhat couple the "PermitDisruption" with "PostFilter" so that:

  1. PermitDisruption is called only if PostFilter was called before and it allowed for successful placement

If so - the conceptually works and I thought about it, and there are two drawbacks:
(a) it introduces dependency on WorkloadSchedulingCycle - this doesn't work in the current model as now we still need to coordinate across different pods.
(b) the computed placement can still be invalidated with pod-by-pod processing. But that should probably be fairly rare and is probably ok.

So I think both drawbacks are acceptable - I would be ok withat if that's what you had on your mind.
If you thought about something different, can you please clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, I would clarify it to "can be scheduled with preemption" and that conceptually makes sense.

Right, I meant when the workload can be scheduled, but with preemption.

(a) it introduces dependency on WorkloadSchedulingCycle - this doesn't work in the current model as now we still need to coordinate across different pods.

I'm not sure if I see your point. What dependency do you have in mind?

I meant that for the workload pods in the workload scheduling cycle we could do:

for each pod from a pod group run { Filter -> Score -> Reserve } or { Filter -> PostFilter -> Reserve } (in case of preemption)
after that, if we need any preemptions, call PermitDisruptions and put the pod group back to the queue

For non-workload pods, in standard scheduling cycle, do:

Filter -> PostFilter -> PermitDisruptions -> put it back to the queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - this is exactly the dependency that I was pointing out.
However, we're probably ok with taking this dependency so I will adjust the proposal to reflect that (mentioning the drawbacks explicitly)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the design to reflect it.

`Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
successfully nominated (or assumed) and only after satisfying this condition the preemptions
will be actuated.
Note that as part of this change, we should treat an arbitrary preemption victim as effectively
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do I understand correctly that in pod by pod scheduling, we will have some pods from podgroup failing to be scheduled, going to PermitDisruptions where it will wait to have (assumed Pods + pods in PermitDisruption) > min count for podGroup and if that is reached we will actuate the disruptions and put the pods back in unschedulable queue (keeping the NNN)?

I am thinking about one case. What if more place becomes available in Pod by Pod but still not enough to fit the whole workload? Some pods will land in PermitDisruptions and we will trigger the preemption for PodGroup but it can contain some unnecessary preemptions. I guess we will have to trim the proposed pod group preemptions to only preemptions related to pods that eventually landed in PermitDisruption?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following, but with the updated version this is clearly no longer relevant concern.

- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
If we decide to change that it will be addressed in a dedicated KEP.
- Propose any tradeoff between preemption and cluster scale-up.
- Design workload-level preemption triggerred by external schedulers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/triggerred/triggered/

will result in preempting another workload, the cost of that second preemption should be
included in the cost of the original preemption.

The implication of the second one is that we should always try to avoid cascading preemptions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So, this seems to suggest that the decision to preempt is made on a single workload rather than possibly multiple workloads. Instead, I would think of preemption as one tool to satisfy the scheduling of a higher priority workload when lower priority ones are consuming resources. That may involve preemption (and rescheduling) parts of several workloads to meet, for example, the topology constraints of the higher priority workload.

I guess I am concerned that trying to implement "workload aware preemption" from the point of view of the workload being preempted is the wrong approach. Instead, I would expect that the scheduling cycle for the high priority workload would emit different plans (sort of like the placements @44past4 discusses, but with broader actions), and we would choose the lowest cost plan. That plan would take into account multiple workloads of varying priorities (this violates the "non-goal" of not worrying about rescheduling, btw). For example, if you had workloads A, B, C with that priority, and A had a "rack" topology constraint, you may end up with a plan like:

  • Preempt 20 pods from workload C across 5 different racks to make to free up devices for workload B. These workload C pods will sit in pending until more resource free up.
  • Move 20 pods from workload B out of rack 100 and spread them across the five racks freed up above
  • Schedule 40 pods from workload A that has a rack topology constraint onto the rack freed up by B pods

Is the goal of this KEP is just to define how workloads can advertise their preemption policies and constraints, and maybe some individual, localized decision making that can eventually roll up into a larger plan like that via different KEPs? If so I can see it as a step towards a smarter planner, that doesn't use "cascading preemptions" but instead constructs a multi-step plan. Maybe this will be come clear as I read further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I would think of preemption as one tool to satisfy the scheduling of a higher priority workload when lower priority ones are consuming resources. That may involve preemption (and rescheduling) parts of several workloads to meet, for example, the topology constraints of the higher priority workload.

+100 to this - I don't think anything in the KEP is contradicting this.
Scheduling a workload may require preempting (potentially partially) multiple workloads.
The only point that we're making here is that when there are multiple options to achieve it we should try to take into account the cascading effect of the decision and prioritize for it.

I guess I am concerned that trying to implement "workload aware preemption" from the point of view of the workload being preempted is the wrong approach.

Let me clarify this principle - the goal is always scheduling the preemptor, the question is what should we preempt if there are multiple options and that is what the principle is above.

Is the goal of this KEP is just to define how workloads can advertise their preemption policies and constraints, and maybe some individual, localized decision making that can eventually roll up into a larger plan like that via different KEPs?

Right - so that is stated in the goals - let me clarify a bit better.
But the goal is a foundation that and concepts. The actual scoring will be fairly naive and improvements to that are not the goal of this KEP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the framing in the KEP.

same priority. We may decide to relax that assumption in the future follow-up enhancement.
1. However, we start with separating the concepts of scheduling and preemption priorities from
the very beginning. The first one is simple generalization of pod priority concept. The
later allow for dynamically adjust the expected cost of preemption of a given workload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/later allow/latter allows/

+1 to this separation


const (
// PreemptionModePod means that individual pods can be preempted independently.
PreemptionModePod = "Pod"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like there's a middle ground between "any pod can be preempted, ignoring the atomicity of the pod group" and "preempt the whole podgroup". In fact it would seem pretty common that you might need to preempt a few pods from one group to make space (again thinking topological constraints) for a new workload, and that those pods from that podgroup would still need a place to start back up. In other words, the podgroup still needs to be able to be made complete, otherwise the cost metric should be "whole podgroup lost".

It's worth mentioning here, that we want to introduce the same defaulting rules for
`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
and there exists PriorityClass marked as `globalDefault`, we default it to that value.
This consistency will allow us to properly handle when users are not setting neither pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/users are not setting/users set/


Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
set at the pod level may not reflect priority used for preemption). We argue that its mutable
nature makes it infeasible for reconsiling this information back to pods for scalability reasons
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reconsiling/reconciling/

- workload C has scheduling priority `low` but preemption cost `high`
In such case, the preemption cost would result in choosing workload B for preemption. But
if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
This is the reason why a cost-based model was discarded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to assume the preemption decision is simply a "priority, then cost" decision, but could in fact be some function of them. I guess that's what you mean by "scoring". I think when you combine "cost" with non-isolated decisions, you can get a better result. By isolated, I mean, not choosing to consider A, B, and C all in the same scheduling decision for "A", but instead just pairwise decisions of "A" and "B" vs "A" and "C". From what I am understanding, the plan is to consider only the pairwise options; I think cascading preemptions may be inevitable in that case (or we may have to severely limit utility).

disruption.
Note that this problem already exists in the current gang scheduling implementation. A given gang may
not proceed with binding if the `minCount` pods from it can't be scheduled. But the preemptions are
currently triggerred immediately after choosing a place for individual pods. So similarly as above,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/triggerred/triggered/

feasible (e.g. because higher priority pods were scheduled in the meantime).

The rationale behind the above design is to maintain the current scheduling property where preemption
doesn't result in a committment for a particular placement. If a different possible placement appears
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/committment/commitment/

@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from 305bad0 to e8c515f Compare December 19, 2025 08:39
@wojtek-t wojtek-t force-pushed the workload_aware_preemption branch from e8c515f to 24ca8cf Compare December 19, 2025 10:02
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I addressed all the comments/concerns brought up so far. PTAL

Comment on lines 601 to 612
For pods being part of a workload, `PermitDisruptions` will be implemented by the already
existing `GangScheduling` plugin. The implemention will be a sibling to the existing
`Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
successfully nominated (or assumed) and only after satisfying this condition the preemptions
will be actuated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the design to reflect it.

`Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
successfully nominated (or assumed) and only after satisfying this condition the preemptions
will be actuated.
Note that as part of this change, we should treat an arbitrary preemption victim as effectively
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm following, but with the updated version this is clearly no longer relevant concern.

@k8s-ci-robot
Copy link
Contributor

@wojtek-t: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 30b916a link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.