[KEP-5710]: Workload-aware preemption KEP #5711

wojtek-t · 2025-11-28T14:26:13Z

One-line PR description: First draft of Workload-aware preemption KEP
Issue link: Workload-aware preemption #5710

wojtek-t · 2025-11-28T14:26:33Z

44past4 · 2025-12-01T09:21:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


Having two independent priorities for a workload - one for scheduling and one for the preemption or the single preemption priority which can be dynamically updated can potentially lead to a cycle in the preemption.

Let's assume that we have an existing workload A with high scheduling priority and low preemption priority running in a cluster.

Now let's assume that we want to schedule a workload B which has medium scheduling priority and medium preemption priority.

Workload B will preempt workload A and will start to run because its scheduling priority > preemption priority of the workload A.

However when workload A will restart and it will be rescheduled it will preempt workload B and will start to run because its scheduling priority > preemption priority of workload B.

The same issue can happen if we will have only one priority but this priority will be reduced while the workload is running. After preemption when the workload will reappear with the original higher priority it can preempt the workload which has preempted it.

One potential solution / mitigation to the described problem could be stating that preemption priority >= scheduling priority. This way after restarting the preempted workload will not be able to preempt the preemptor workload.

Thanks for point that out!

Yeah - "preemption priority >= scheduling priority" is definitely desired. I don't think we have any usecases that would benefit from the reversed.

That said, I need to think a bit more if that is enough. I think it prevents the cycles if we assume static priorities, but it can still potentially trigger cycles if the priorities will be changing. OTOH, if the priorities are changing this is probably desired.

Let me think about it a bit more and I will update the KEP to reflect the thoughts later this week.

OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

sanposhiho · 2025-12-01T13:28:35Z

/assign

erictune

Great to see this, and I like how it is decoupled from the other work planned for 1.36.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

erictune · 2025-12-02T18:58:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+can't reprieve any of those, learning about that would require O(N) full workload schedulings
+with N being number of workload/pods violating PDB.
+<<[/UNRESOLVED]>>
+```


Let's assume that nodes are either high-pod-per node count, or low pod-per-node count. Its a bimodal distribution.

Let's further assume that if Gang scheduling is used, then the node is going to usually be low pod-per-node count.

So, then we can do the following:

Individual Pod as preemptor - assume high pod-per-node, use current algorithm, which is optimized for many pods per node, consider all victims.

Gang as preemptor - assume low pod-per-node in all cases, consider a maximum of e.g. 4 reprieves per node, to keep compute time down, and just stop reprieving in the case where there are more things on the node.,

Every split in the algorith/code path makes it harder to reason about. This is why I'm trying to avoid that whenever possible.

Additionally, while I agree with you that in majority of cases it will be true, there are definitely usecases where people run gang workloads with many pods per node. So in my opinion the split as proposed could potentially result in decisions that would be really far from the optimal ones.

In the spirit of trying to simplify and unify stuff as much as possible I actually adjusted the algorithm so that we can have a single scheme that addresses all four usecases that we have. I think this is much better option.

PTAL

xigang · 2025-12-03T00:55:18Z

/cc

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Argh4k · 2025-12-03T10:07:32Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. From remaining potential victims, we start to reprieve pods starting from the highest priority
+   and working down until the set of remaining victims still keeps the node feasible.
+
+Once we compute the feasibility and list of victims for all nodes, we score that and choose the


Nit: it's possible that we will not do that for all nodes in the cluster. We find feasible nodes until we have max(numNodes * 0.1, 100) nodes from which we can choose from: https://github.com/kubernetes/kubernetes/blob/ec1bf8a4f3a5f054065225dc8275c66b93310d17/pkg/scheduler/framework/preemption/preemption.go#L363-L364

Good catch - updated (although I don't think it changes anything for this particular proposal).

Probably not for the initial implementation but it's worth to keep it in mind once we look into the scalability of workload preemption

Argh4k · 2025-12-03T10:08:35Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W
+
+1. If removing all the potential victims would not make the new workload W schedulable,


I think we should point out that this depends on workload aware scheduling which is not yet implemented and is planned for 1.36.

Argh4k · 2025-12-03T10:09:09Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. If removing all the potential victims would not make the new workload W schedulable,
+   the workload is unschedulable even with preemption.
+
+```


Nit: you need to indent this "code block" to keep the numbering continuous.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

macsko · 2025-12-03T16:08:15Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


What if there is a workload and an individual pod, where only one is needed to make the new workload schedulable. Which one will be chosen?

I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

I guess if they have the same priority then: single pod > pod from workload with gang preemtable = false > workload with gang preemtable = true?

macsko · 2025-12-03T16:12:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
+   `WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
+   before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
+   `GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
+   work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
+   already available to use. The implementation will work similarly to `WaitOnPermit` to
+   ensure that `GetResources` was executed for all pods from within a `PodGroup`.


How will the preemption targets be released when we after all don't run the RunGetResourcesPlugins? For example, when a gang turns out being unschedulable

That's very good question. I think we want something conceptually similar to "Reserve/Unreserve" pattern from DRA.

So scheduling phase will effectively serve as "reserve" phase and we we will have a sibling method of "unschedule" that will be able to re-assume the victims.

It requires some description though.

macsko · 2025-12-03T16:17:47Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
+becomes a challenge, thus we modify to the approach below.
+
+To check if a workload W can be scheduled on a given cluster with preemption we:


Shouldn't we talk about a "gang pod group" rather than a "workload"?

I don't have strong opinion here - let me change it.

Argh4k · 2025-12-04T10:28:06Z

Do we want to add as a part of this KEP a description of how the preemption fits the workload aware scheduling (codewise)? Or do we want to have this other way around, have the KEP for workload aware scheduling reference this one when talking about preemption?

In the gang scheduling KEP we talk about adding a "Workload" phase where we will end up with a pods from Gang with a nominated node names. I assume that this preemption will be a part of this phase. The open question is what actually will be the outcome of the preemption:

will the workload premption trigger the preemption, counting on delayed preemption to actuate it
will the workload preemption mark pods for preemption and the trigger will be done by the current preemption in the pod post filter? This is actually a preferred option by me as it will also take into consideration changes that happened in the cluster between workload scheduling cycle and pod scheduling.
something else?

Argh4k · 2025-12-04T10:41:09Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary
+preemptions. However, this is not true for the current gang scheduling implementation.
+In the current implementation, preemption is triggered in the `PostFiler`. However, it's entirely


So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

So the reasoning here is that we want delayed preemption because it helps with the current gang scheduling implementation. But I believe that actually in this doc we could describe why we need it in terms of the workload preemption and IIUC this is to have an option to run workload preemption as part of the workload scheduling without immediately actuating the preemptions.

Great point - I updated this paragraph to reflect that.

I added this also in a PR discussion, I think it would be beneficial to have a section on what will be the outcome of workload preemption and if it does not actuate the preemptions, what actually will do that.

I hope that an update KEP for gang scheduling that will describe the workload scheduling phase will be opened pretty soon and it will describe it. And I will be able to just link to it here :)
@macsko ^^

Argh4k · 2025-12-04T10:42:53Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. New field in the workload object (delayed preemption will not bring much value in
+   case of scheduling individual pods, though there would be significant benefit from
+   unification, so probably this isn't ideal option).
+1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and


This does not allow external schedulers to use the same concept for victims nomination.

I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

wojtek-t

I tried to address most of the comments, I will try to respond/address the remaining ones later today/tomorrow.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

wojtek-t · 2025-12-04T12:11:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. From remaining potential victims, we start to reprieve pods starting from the highest priority
+   and working down until the set of remaining victims still keeps the node feasible.
+
+Once we compute the feasibility and list of victims for all nodes, we score that and choose the


Good catch - updated (although I don't think it changes anything for this particular proposal).

wojtek-t · 2025-12-04T12:13:00Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


OK - I have added an unresolved section about that to the Workload priorities section above describing the problem, potential solution and alternatives. Let's continue the discussion there.

wojtek-t · 2025-12-04T12:14:27Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
+becomes a challenge, thus we modify to the approach below.
+
+To check if a workload W can be scheduled on a given cluster with preemption we:


I don't have strong opinion here - let me change it.

wojtek-t · 2025-12-04T12:23:33Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+1. Identify the list of potential victims:
+   - all running workloads with (preemption) priority lower than the new workload W
+   - all individual pods (not being part of workloads) with priority lower than the new workload W


I think we should choose pod, but I don't have super strong preference. I added a point about sorting to reflect that but I'm happy to take any suggestions there.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

wojtek-t · 2025-12-04T12:28:37Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+1. New field in the workload object (delayed preemption will not bring much value in
+   case of scheduling individual pods, though there would be significant benefit from
+   unification, so probably this isn't ideal option).
+1. Storing it in private kube-scheduler' structures (PodInfo for individual pods and


I would like to keep external schedulers out of scope for now - added explicitly to the non-goals section.

erictune

LGTM

keps/sig-scheduling/5710-workload-aware-preemption/README.md

erictune · 2025-12-04T20:14:48Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+can't reprieve any of those, learning about that would require O(N) full workload schedulings
+with N being number of workload/pods violating PDB.
+<<[/UNRESOLVED]>>
+```


keps/sig-scheduling/5710-workload-aware-preemption/README.md

k8s-ci-robot · 2025-12-04T20:28:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erictune, wojtek-t
Once this PR has been reviewed and has the lgtm label, please ask for approval from sanposhiho. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich

Thanks @wojtek-t, overall looks great!
I left a few questions.

andreyvelich · 2025-12-04T20:42:13Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- Define the scheduler changes needed to implement workload-aware preemption
+- Provide full backward compatibility for all existing scheduling features
+
+### Non-Goals


What about partial preemption of a Workload?
I would imagine with DependsOn API in the JobSet that is something we should talk about at some point.
E.g. supporting Argo workflows in Kueue: kubernetes-sigs/kueue#74

cc @kannon92 @tenzen-y @mimowo

Can you clarify what exactly you mean here?

We definitely don't want to require the whole gang to always be preempted together, this should be optional. This is reflected together.

We don't yet want to allow arbitrary granularity, though I wouldn't exclude defining lower granularity units later.

But now sure if any of those actually is what you're seeking for with this comment.

If in the future we allow users to preempted group of pods from gang, how that will work? Will we introduce a new API for that?

Right - we were considering the concept of PodSubGroup. I added the "Potential future extensions" section and sketched how this can be achieved later (as well as some other potential stuff).

Similar requests: kubernetes-sigs/kueue#3762 and kubernetes-sigs/kueue#975.

Let's say you have MinCount set to not max number of pods (say a Job with x parallelism and y MinCount where y < x). In theory you could have preempted down to MinCount. And the workload still satisfies gang requirement.

So I could see a case where this is useful for LWS or JobSet where maybe we can preempt entire replicated jobs or worker groups. And if the workload can tolerate that disruption it adjusts.

Now I see this more useful for deployments/serving as they usually could tolerate upscaling/downscaling easier than a batch workload. But I believe Ray and Spark can tolerate use cases like this.

But honestly this is a maybe not a trivial task itself and could be considered for future.

I believe, with Spark Dynamic Allocation feature that will be critical to have.
cc @bigsur0 @akshaychitneni @shravan-achar

So to be clear - I'm 100% convinced that we will need another policies. And the policy of "preempt individual pods up to minCount and then the whole PodGroup" is absolutely a valid policy that I can imagine.

The API faciliates this extension and I the implementation can also be adjusted to that.
But the goal of this KEP is not to introduce all policies that we believe will be useful, but build the foundations and allow for those extensions later.
Clarified that in the goals/non-goals section.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

andreyvelich · 2025-12-04T20:48:03Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+type GangSchedulingPolicy struct {
+    // Existing field(s).
+
+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool
+}


Shall we try to design an API that is future proof?
What if in the future we allow to partially preempt group of pods from gang for elastic training?

Here is how I was thinking about it:

at some point we will introduce "PodSubGroup" as it was described in the original gang scheduling doc: https://tiny.cc/hvhs001 (whatever the name will be)

at this point, PodSubGroup may actually become the preemption unit

we will have a corresponding boolean flag at the level of PodSubGroup at this point

if you want PodSubGroup be the preemption unit, you will set that field instead of setting it here

The above model will be compatible with this addition.

If that doesn't address your usecase, can you please explain your usecase in more detail?

PodSubGroup sounds great! Do we have any tentative API design for it ?
Are we planning to introduce this as part of PodGroup API object?

type PodGroup struct { Name *string ... PodSubGroups []PodSubGroup }

I am also curious what if in the future, someone want to preempt multiple PodSubGroups within single a PodGroup ?

Just like an idea, we can introduce PreemptionPolicy API which can describe such groups:

type Workload struct { PreemptionPolicy *PreemptionPolicy } type PreemptionPolicy struct { PriorityClassName *string PreemptionPriorityClassName *string TargetPodGroups []PreemptionGroup } type PreemptionGroup struct { // Name of the group. Name string // Target PodGroup or PodSubGroup name to be preempted together. TargetPodGroup []string }

PodSubGroup sounds great! Do we have any tentative API design for it ?

It was described in the original doc as future extensions:
https://docs.google.com/document/d/1ulO5eUnAsBWzqJdk_o5L-qdq5DIVwGcE7gWzCQ80SCM/edit?tab=t.3pkx7y4zvho2
(you're co-author there :) )

Regarding PreemptionPolicy - the reason why I didn't go with that is that I believe that it doesn't make sense if we aren't gang-scheduling (so it only make sense in gang-scheduling mode). I can't imagine any usecase where preemption unit is larger than scheduling unit - and without a gang policy scheduling unit is an individual pod.

I'm happy to adjust the API but we need to take that somehow into account and cross fields validations are always more confusing to users.

Oh, you are right!

I can't imagine any usecase where preemption unit is larger than scheduling unit

That is a good point, however how can we preempt multiple gangs within the workload together?

Let's say our Workload is a workflow that contains multiple steps, some steps use gang policy, some of them don't.
Additionally, users might want to preempt only desired steps from the workflow.

How would preemption work in that case?

cc @kannon92 @tenzen-y @astefanutti

In case you are also interested in the two-layer scheduling problem.

In this case I think we could consider skipping workloads for JobSet and have each job be a separate workload scheduling.

So we let Job controller create and schedule as is and JobSet may not be treated as a whole gang.

I don't know if this needs to be in scope here. I think if job gets preempted under a JobSet JobSet will keep reconciling state,

We could end up with a mixed situation where some ReplicatedJobs should be preempted together, but others should not from a single JobSet.

For example, consider a JobSet with three ReplicatedJob templates: Initializer, MPI Launcher, and MPI Worker.
The Initializer job should be scheduled first. After it completes, the MPI Launcher and MPI Worker jobs should be scheduled together as a single gang.

How the Workload resources look like in that case?

🤷 Honestly I'm not entirely sure everything will be solved with this PR.

JobSet knows a lot more about how these different pods should be scheduled than the kube-scheduler.

It feels like this could be a future KEP. ala WorkloadSequence.

Worst case, a JobSet author can schedule everything together and let jobset handle orchestration via dependsOn.

I think preemptive/smart batch workloads is a hard problem and maybe for first draft we focus on preempted the workloads.

For example, consider a JobSet with three ReplicatedJob templates: Initializer, MPI Launcher, and MPI Worker.
The Initializer job should be scheduled first. After it completes, the MPI Launcher and MPI Worker jobs should be scheduled together as a single gang.

I guess I see two "gangs". Initializer is one. And Launcher and Worker are a separate gang that has to start after Initializer is ready.

So dependsOn with Initializer Ready.

Jobset creates Initiailzer workload request and uses dependsOn to wait for that Job to be ready.

And then JobSet creates Launcher/Worker as a separate gang.

But I don't see why/how this should block this KEP. Scheduler should be aware of gangs and preempt them. And workload controllers can figure out how the gang should work in the context of the API.

#5547 seems to be more in scope than this KEP IMO.

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Argh4k · 2025-12-05T08:42:59Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool
+}


Have we considered whether something similar to preemptionPolicy: Never makes sense for Workloads? Do we know whether there are use cases for a workload that should just wait for the place on the cluster without preempting other pods/workloads but it also requires the whole gang to start at once?

Yes - I can definitely imagine usecases where it makes sense (CI workloads as an example).

But this is also a power of not reinventing the concept of priority from scratch and using existing PriorityClasses - by using it at the workload level, we effectively get all of its features roughly for free.

sanposhiho · 2025-12-07T11:52:22Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+we want to support and is compatible with the current pod-based preemption algorithm. This means
+we will be able to achieve in-place replacement with relatively localized changes.
+
+### Delayed preemption


Actually, it would make one big difference from our current preemption: What if the pod's deletion take time and meanwhile there could be other places to be available for this workload?
Today, a preemptor pod doesn't wait for victim pod deletion(s) to be completed. That helps, if other places become available meanwhile, a preemptor pod could still be scheduled there.
The time to complete the deletion could be a lot longer/worse when it comes to a victim workload because a victim workload could contain thousands of pods. (Also, it's typical for ML clusters that victim pods have to do something fancy (checkpointing etc) at the termination.)
The current proposal looks like a preemptor pod is just going to be blocked WaitForGetResources? But, is it really ideal that a high priority preemptor workload might have to wait for a long time to get all victim pods deleted, while on the other hand there might be some new empty spaces in the cluster where the preemptor workload can be scheduled actually.

This is great point - I thought about it in the past, but later completely forgot it.

One point that I would phrase differently is: I'm less concerned about workload consisting of multiple pods. This is true, but on its own it generally will not be the primary factor for problems.
The primary factor for problems most often will be grace period and that is a problem independently whether we're preempting whole workload or just an individual pod.

When I thought about it in the past, the only reasonable answer I came up with was:

let's introduce a timeout and if all the victims are not preempted within that timeout, we return all the pods back to the scheduling queue

However, it should not result in clearing up the nominatedNodeName for these - in other words we continue to assume that they are still waiting for preemptions to happen and be scheduled there

in this case we try to schedule them again - if we can schedule them without preemption we go for it, if preemption is still required we actually don't change it

I think that works on paper and seems good enough conceptually, but the question is whether it doesn't break some implementation assumptions that I'm not aware of.
@sanposhiho @macsko @dom4ha @tosi3k

Why do we need a timeout actually? I mean, today's scheduler behavior with the pod preemption equals to where we set timeout=0 on your explanation, right? because we just return pods to the queue immediately today.
Today's behavior makes more sense to me because if a huge workload finishes immediately after some workloads triggered preemptions, and there's a huge empty space in the cluster, I believe we want to just schedule those pending workloads there without waiting for new victim workloads to be terminated or they reach the timeout and returned to the queue.

I mean, today's scheduler behavior with the pod preemption equals to where we set timeout=0 on your explanation, right? because we just return pods to the queue immediately today.

I think not fully. So the part of "return immediately to the queue" is correct (so in that sense it equals to timeout=0 case).
But unless I'm missing something currently we're always running full scheduling cycle now, so if we realize that it's better to place me on some other node N2 instead of previously chosen node N1 (even though we already triggered preemption on N1), we can do that.
What I described is changing that: we no longer run try preemption (PostFilter) in further attempts - we may change the placement if there is already available one, but if we need to preempt something, we rather stick to the original placement.

I think that timeout=0 makes sense, but we should try to avoid triggering unnecessary preemptions.

we no longer run try preemption (PostFilter) in further attempts

But, what if some higher priority pods/workloads already took over some places? So, I believe we should actually do that. So, my overall thoughts are:

workloads are immediately going back to the queue after triggering the preemption.

workloads can retry scheduling while waiting for all victim pods to complete. It allows workloads to be scheduled asap, without waiting for victim pods' termination if possible.

workloads can try preemption at those retries, but it should take all on-going preemption into consideration and should try not to make any unnecessary preemptions wisely.

I'm imagining an example specific scenario like:

workloadA triggers preemption. Pod#1 ... Pod#4 will be deleted. workloadA goes back to the queue immediately.

Pod#1 is terminated pretty soon. Other Pod#2 .. Pod#4 are still being terminated.

A higher priority workloadB took the empty place made by Pod#1 termination simply because it's higher priority than workloadA.

workloadA is rescheduled for some reason (some cluster events, e.g., a new node is added etc).

The scheduler still runs try preemption. It "tries to" pick up just one pod (assuming all pods are the same size to simply this example) because the place for just Pod#1 is already taken by workloadB. This preemption should be aware of the fact that workloadA is still waiting for Pod#2 .. Pod#4 and hence should try not to make any further unnecessary preemptions. Speaking of the implementation, when selecting the victims, it should first prioritize scheduling the pods from workloadA onto the domain/nodes that Pod#2...Pod#4 are running.

The reason I stressed "try to" at (5): At that point, workloadA might end up needing to preempt the whole different set of pods on different domain because it might not be able to find a new victim pod on the domain of Pod#2...Pod#4. In this case, it has to preempt different 4 pods, but that is NOT unnecessary preemptions because workloadA won't be schedulable after Pod#2...Pod#4 are terminated.

workloads can try preemption at those retries, but it should take all on-going preemption into consideration and should try not to make any unnecessary preemptions wisely.

Sure - that was my point (just not stated clearly). We need to take into account that the original placement may no longer be valid one.
But we shouldn't try to arbitrarily choose a different place that requires different preemptions if the original one is still valid - we're just waiting for preemptions to finish.
So as example, if I previously preempted workloadA and once it is preempted we will have a place to run our workload, in the second attempt we shouldn't try to preempt workloadB running in a completely different place because that space would be scored higher.
We should treat "triggerred preemptions" as "free space" and trigger additional preemptions only if the already triggerred ones are not enough.

Yup, I think we're on the same page here.

So I think I'm on the same page but then the question is whether we really need the GetResources plugin? I understand that it is necessary for current implementation of gang scheduling and according to this discussion it should actuate preemption + put back pod to the queue (so it has chance to take other spot if we get new free space), but in a world where we have separate workload scheduling phase, which includes this workload preemption phase, should we actuate preemptions after this phase, together with putting the pods to the queue? Or do we want to have this flow: calculate NNN and preemptions for all pods from workload in workload scheduling/preemption -> run pod by pod scheduling to confirm placement -> when pod reaches GetResources trigger preemption and put it back on queue? In that case the pod would get through the scheduling phase 2 times + 1 time through workload scheduling.

The pod-by-pod scheduling provides a final confirmation that the placement works, so I think we want to have that at least in the foreseable future.

I have update the KEP to reflect that better.

@sanposhiho @dom4ha @macsko @Argh4k - PTAL at the updated version

sanposhiho · 2025-12-07T12:02:14Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.


The comment is unclear: What means if it's false? whether it can be preempted partially or it cannot be preempted at all?

The switch to enum should make it cleaner now.

sanposhiho · 2025-12-07T12:03:15Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool


I feel like we should consider enum here instead of bool? like preemptionPolicy: gang || never, and we can add new value(s) later? e.g., partially.

I wouldn't introduce Never here explicitly - we can use PriorityClass for that where we are already able to represent these kinds of concepts:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/core/v1/types.go#L2794-L2801

But overall - the comment about switching to enum makes sense to me - for now we will have:
individual pod (default) and gang (better names needed) and eventually we can extend it further.

I will adjust that.

Switched to enum.

x13n · 2025-12-08T09:46:41Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- it needs to preempt one of the already running workloads
+- workload B has scheduling priority `med` but preemption cost `low`
+- workload C has scheduling priority `low` but preemption cost `high`
+In such case, the preemption cost would result in choosing workload B for preemption. But


Flyby comment: wouldn't it be sufficient to target lowest priority workloads possible and use cost only as a tie breaker?

romanbaron · 2025-12-16T07:37:55Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+const (
+  // PreemptionModePod means that individual pods can be preempted independently.
+  PreemptionModePod = "Pod"


When gang scheduling policy is configured with PreemptionModePod, is it correct to assume that pods are evicted individually until the PodGroup size drops below minCount, at which point the remaining pods are preempted collectively?

I don't think that's the case. I expect that those actually do not care about minCount and they are either allowing pod by pod preemption or preemption of the whole pod group no matter what the size of pod group is in relation to its minCount. Maybe the behavior you describe can be added as a third option?

Btw, there is an ongoing discussion about exact minCount semantics in workload scheduling doc #5730 (comment)

Right - I expanded the comment trying to clarify this.

It does seem like there's a middle ground between "any pod can be preempted, ignoring the atomicity of the pod group" and "preempt the whole podgroup". In fact it would seem pretty common that you might need to preempt a few pods from one group to make space (again thinking topological constraints) for a new workload, and that those pods from that podgroup would still need a place to start back up. In other words, the podgroup still needs to be able to be made complete, otherwise the cost metric should be "whole podgroup lost".

Sure - I definitely agree that more policies will be needed.
I just don't want to invent every potential policy as part of this KEP - we can have follow-up KEPs (or extensions to this one) in introduce more policies - the API structure should faciliate this extensibility.

romanbaron · 2025-12-16T11:10:41Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   1. For the remaining potential victims, using binary search across priorities find the minimal
+      priority N for which scheduling the preemptor can be achieved without preempting any victims
+      with priority higher than N. This allows to reduce the potential cascaiding preemptions later.


In theory, it is possible that the only feasible solution might require evicting pods with different priorities, which a binary search would miss because the problem is fundamentally combinatorial. This can happen when the preemptor has additional constraints, such as affinity, that can only be satisfied by removing specific set of pods rather than those that naturally fall out of the current sort order. That said, it might cause cascading preemptions later on, so it ultimately comes down to a trade-off. I thought it was worth highlighting, even though I’m unsure whether a fully robust algorithm can be implemented within the current scheduling framework.

In theory, it is possible that the only feasible solution might require evicting pods with different priorities, which a binary search would miss because the problem is fundamentally combinatorial.

Maybe the description is not clear, but a binary search at level X assumes "we preempt everything with priority X or lower". So it we need to evict pods of different priorities we just try to minimize the max one and that's what this binary search does. So that works.

That said, it might cause cascading preemptions later on, so it ultimately comes down to a trade-off.

Sure - we use heuristicts like the above to minimize them, but we're not attempting to find the ideal solution.

atiratree · 2025-12-16T12:40:10Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
+may no longer reflect the actual pod priority, which could be misleading to users.


How should the current users consume this? For example, Graceful Node Shutdown uses priorities to decide on the order of termination of pods. I suspect it is used in other node drain/maintance scenarios such as picking the least disruptive node priority-wise.

Which priority should be the chosen one for disruption? Should it be Pod, Workload, or PodGroup?

atiratree · 2025-12-16T12:43:59Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    PriorityClassName *string
+
+    // PreemptionPriorityClassName, if specified, indicates the workload's
+    // priority that should be used when attempting to preempt this workload.
+    // If not specified, it will default to PriorityClassName.
+    //
+    // This field is mutable.
+    PreemptionPriorityClassName *string


As mentioned, PreemptionPriorityClassName could be useful in other scenarios (e.g. GNS). Maybe,DisruptionPriorityClassName would be more fitting?

atiratree · 2025-12-16T13:03:31Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+<<[UNRESOLVED priority status]>>
+We should introduce/describe `workload.status` to reflect:
+- actual priority (a number) that is inferred from the priority class
+- express the preemption priority that is currently used by scheduler to reflect that
+  it aknowledged the change
+<<[/UNRESOLVED]>>


What should the authoritative source be for priorities for all kinds of disruptions?

Have we considered adding this information to the Pod's status? For example, .status.disruptionPriority which would be reconciled according to all the available information (Pod's .spec.PriorityClassName and .spec.priority, Workload's .spec.preemptionPriorityClassName and .spec.PodGroups[].preemptionPriorityClassName)

Another option would be to track the priorities for each PodGroup and the final Workload priority in the Workload status. For now, we could just copy the .spec.preemptionPriorityClassName resolved integer priority for both the pod groups and the final one. We could extend the space later for better granularity and strategies for computing the aggregated priority.

Introducing one of these options would improve observability and enable simulation of preemption and other disruption scenarios.

atiratree · 2025-12-16T13:29:02Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // priority that should be used when attempting to preempt this workload.
+    // If not specified, it will default to PriorityClassName.
+    //
+    // This field is mutable.


Is there e a pressing use case that we have to cover here? FYI, we have received a lot of negative feedback about the pod-deletion-cost feature over the years. One issue is that updating a cost (in this KEP's case a priority) is not scalable, especially when considering a large number of pods that are updated often. This issue should not be that pronounced in the Workload, since it groups a set of pods, but do we need to make this field mutable in alpha?

One case for mutability is periodic checkpointing to prevent disruption until it completes. Presumably, the priority will return back to normal afterward, and the workload will probably get disrupted.

An alternative to this is the EvictionRequest, which lets the interceptors complete the checkpointing. It is more versatile because it can also trigger checkpointing. Since EvictionRequest will be used in other disruption scenarios, it would be ideal if we could use the same mechanism. This needs more analysis, but we could add an observable disruption priority to the interceptors, for example.

In essence, setting a critical priority to PreemptionPriorityClassName is similar to using a blocking interceptor.

There are multiple potential usecase for deletion - the fact that we don't need to set it at individual pod level finally makes that feasible.

Regarding mutability - checkpointing is my primary usecase. Using EvictionRequest for that is certainly an option, but even know we're not using "evict" API for preemption e.g. due to incompatibility with PDBs.
I would definitely like to eventually use EvictionRequest, but I don't want to block on that to happen.

That said, we can definitely consider moving mutability to post-Alpha if there are doubts here.

Thanks for considering it for the post Alpha. We will see how much progress we make with EvictionRequest in 1.36, and then we can decide the best way to make it all work together.

After thinking more about it, I think this actually is purely additive feature. Given that it adds non-negligible complexity, visibility (having clear status to reflect the state is must-have), etc., I actually removed it from the scope of this KEP.
We should have a separate dedicated KEP for that.

So in this KEP, we're only introducing preemptionPriority concept but it is immutable.

atiratree · 2025-12-16T14:03:10Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   1. If removing all potential victims would not make the preemptor schedulable, the preemptor
+      is unschedulable with preemption in currently considered domain D.
+
+   1. Sort all the potential victims to reflect their "importance" (from the most important to the


I am curious about where the following scoreFuncs or other scoreFuncs come into action:

https://github.com/kubernetes/kubernetes/blob/c180d6762d7ac5059d9b50457cafb0d7f4cf74a9/pkg/scheduler/framework/preemption/preemption.go#L702-L714

Can we use the node scoring for the pod scoring as well, since we will consider all nodes at the same time for Workloads? I.e. run scoreFuncs on node from .spec.nodeName.

These functions are specific in a way that they already assume that we computed victims for a given node.
So we can't really use them at this point.

Where they come into play is the "top-level 3 point" where we score different scheduling decisions.

I'm definitely open to more sophisticated sorting algorithm here too, but let's postpone it after alpha to prove the model (changing the sorting function is relatively simple change).

atiratree · 2025-12-16T15:35:27Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+  PreemptionModePod = "Pod"
+  // PreemptionModePodGroup means that the whole PodGroup replica needs to be
+  // preempted together.
+  PreemptionModePodGroup = "PodGroup"


We could use this grouping in the EvictionRequest as well. Please see https://github.com/atiratree/kube-enhancements/blob/evacuation-api/keps/sig-node/4563-eviction-request-api/README.md#workload-api-support.

Do you expect preemption only specific behavior? Could we name this DisruptionMode?

I don't expect preemption-specific bahavior yet.

I think that renaming it to DisruptionMode makes sense, but for now I added it as "unresolved" point above. I will wait for potential concerns from others.

atiratree · 2025-12-16T15:39:30Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+```golang
+// PreemptionMode describes the mode in which a PodGroup can be preempted.
+// +enum
+type PreemptionMode string


Would it be useful to add a support for preempting/disrupting all PodGroups at the same time?

I don't think so.

PodGroups can be scheduled individually and I claim that if these are scheduled independently, they can also be preempted independently.
If the really should be preempted together, they should probably form a single gang.

Okay, I think this answers the same question for the EvictionRequest disruption modes. We will most likely use the same resolution and either disrupt whole PodGroup or individual pods.

How does a disrtributed program know whether is it being told to have one pod exit, or all pods? I suspect this varies by framework and application.

According to this doc, torchrun forwards sigterm to all other processes:
pytorch/pytorch#154849

For torch elastic, things seem more complex: pytorch/pytorch#67742

wojtek-t

I will try to respond to the remaining comments (and do a bit more adjustments to the KEP) tomorrow.

wojtek-t · 2025-12-17T14:49:19Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- Define the scheduler changes needed to implement workload-aware preemption
+- Provide full backward compatibility for all existing scheduling features
+
+### Non-Goals


Right - we were considering the concept of PodSubGroup. I added the "Potential future extensions" section and sketched how this can be achieved later (as well as some other potential stuff).

wojtek-t · 2025-12-17T14:50:33Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+```golang
+// PreemptionMode describes the mode in which a PodGroup can be preempted.
+// +enum
+type PreemptionMode string


I don't think so.

PodGroups can be scheduled individually and I claim that if these are scheduled independently, they can also be preempted independently.
If the really should be preempted together, they should probably form a single gang.

wojtek-t · 2025-12-17T14:52:44Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+const (
+  // PreemptionModePod means that individual pods can be preempted independently.
+  PreemptionModePod = "Pod"


Right - I expanded the comment trying to clarify this.

wojtek-t · 2025-12-17T14:55:23Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+  PreemptionModePod = "Pod"
+  // PreemptionModePodGroup means that the whole PodGroup replica needs to be
+  // preempted together.
+  PreemptionModePodGroup = "PodGroup"


I don't expect preemption-specific bahavior yet.

I think that renaming it to DisruptionMode makes sense, but for now I added it as "unresolved" point above. I will wait for potential concerns from others.

wojtek-t · 2025-12-17T15:00:02Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+type GangSchedulingPolicy struct {
+    // Existing field(s).
+
+    // IsGangPreemptable defines whether all pods from this group should
+    // be preempted in all-or-nothing fashion.
+    IsGangPreemtable *bool
+}


Does PodGroup indicate scheduling unit within the Workload API?

The scheduling uni can be a PodGroup (in case of gangs), but it can also be individual pods.

Let's say I have a JobSet with two ReplicatedJob that runs in a sequence (Initializer + Trainer). Trainer depends on Initializer completion.

The following structure doesn't reflect the fact that Trainer depends on Initializer completion - these will all be attempted to scheduled at the same time, or it may even happen that trainer will be scheduled and initializer won't be.

The workload API as currently designed doesn't have reasonable place to reflect sequencing. We will either need to design a higher-level concept (WorkloadSequence) and builds on top of Workload and e.g. Reservation or visibly reshape Workload anyway to accomodate it.

wojtek-t · 2025-12-17T15:08:22Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+    // priority that should be used when attempting to preempt this workload.
+    // If not specified, it will default to PriorityClassName.
+    //
+    // This field is mutable.


There are multiple potential usecase for deletion - the fact that we don't need to set it at individual pod level finally makes that feasible.

Regarding mutability - checkpointing is my primary usecase. Using EvictionRequest for that is certainly an option, but even know we're not using "evict" API for preemption e.g. due to incompatibility with PDBs.
I would definitely like to eventually use EvictionRequest, but I don't want to block on that to happen.

That said, we can definitely consider moving mutability to post-Alpha if there are doubts here.

kannon92 · 2025-12-17T15:17:33Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+Why should this KEP _not_ be implemented?
+-->
+
+## Alternatives


Was there any consideration on Kueue's WorkloadPriorityClass?

https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta2/#kueue-x-k8s-io-v1beta2-WorkloadPriorityClass

macsko · 2025-12-18T10:39:46Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   For pods being part of a workload, `PermitDisruptions` will be implemented by the already
+   existing `GangScheduling` plugin. The implemention will be a sibling to the existing
+   `Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
+   successfully nominated (or assumed) and only after satisfying this condition the preemptions
+   will be actuated.


Couldn't the PermitDisruptions be just called at the end of the workload scheduling phase, when it sees the workload can't be scheduled? Then, the actuation could be left to the DefaultPreemption plugin that computed the victims. I think this would simplify the logic.

Let me rephrase it because something doesn't parse for me in this comment

PermitDisruptions conceptually should be called when (a) we can't schedule a workload/pod without preemption (b) we call PostFilter plugins (c) PostFilter let us find the placement for the workload/pod

So did you mean "[...] when it see the workload can be scheduled?" instead?

If so, I would clarify it to "can be scheduled with preemption" and that conceptually makes sense.

So what we would do is effectively to somewhat couple the "PermitDisruption" with "PostFilter" so that:

PermitDisruption is called only if PostFilter was called before and it allowed for successful placement

If so - the conceptually works and I thought about it, and there are two drawbacks:
(a) it introduces dependency on WorkloadSchedulingCycle - this doesn't work in the current model as now we still need to coordinate across different pods.
(b) the computed placement can still be invalidated with pod-by-pod processing. But that should probably be fairly rare and is probably ok.

So I think both drawbacks are acceptable - I would be ok withat if that's what you had on your mind.
If you thought about something different, can you please clarify?

If so, I would clarify it to "can be scheduled with preemption" and that conceptually makes sense.

Right, I meant when the workload can be scheduled, but with preemption.

(a) it introduces dependency on WorkloadSchedulingCycle - this doesn't work in the current model as now we still need to coordinate across different pods.

I'm not sure if I see your point. What dependency do you have in mind?

I meant that for the workload pods in the workload scheduling cycle we could do:

for each pod from a pod group run { Filter -> Score -> Reserve } or { Filter -> PostFilter -> Reserve } (in case of preemption)
after that, if we need any preemptions, call PermitDisruptions and put the pod group back to the queue

For non-workload pods, in standard scheduling cycle, do:

Filter -> PostFilter -> PermitDisruptions -> put it back to the queue

Discussed offline - this is exactly the dependency that I was pointing out.
However, we're probably ok with taking this dependency so I will adjust the proposal to reflect that (mentioning the drawbacks explicitly)

I've updated the design to reflect it.

Argh4k · 2025-12-18T13:18:50Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   `Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
+   successfully nominated (or assumed) and only after satisfying this condition the preemptions
+   will be actuated.
+   Note that as part of this change, we should treat an arbitrary preemption victim as effectively


So do I understand correctly that in pod by pod scheduling, we will have some pods from podgroup failing to be scheduled, going to PermitDisruptions where it will wait to have (assumed Pods + pods in PermitDisruption) > min count for podGroup and if that is reached we will actuate the disruptions and put the pods back in unschedulable queue (keeping the NNN)?

I am thinking about one case. What if more place becomes available in Pod by Pod but still not enough to fit the whole workload? Some pods will land in PermitDisruptions and we will trigger the preemption for PodGroup but it can contain some unnecessary preemptions. I guess we will have to trim the proposed pod group preemptions to only preemptions related to pods that eventually landed in PermitDisruption?

I'm not sure I'm following, but with the updated version this is clearly no longer relevant concern.

johnbelamaric · 2025-12-18T20:01:59Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
+  If we decide to change that it will be addressed in a dedicated KEP.
+- Propose any tradeoff between preemption and cluster scale-up.
+- Design workload-level preemption triggerred by external schedulers


s/triggerred/triggered/

johnbelamaric · 2025-12-18T20:14:11Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   will result in preempting another workload, the cost of that second preemption should be
+   included in the cost of the original preemption.
+
+The implication of the second one is that we should always try to avoid cascading preemptions.


Interesting. So, this seems to suggest that the decision to preempt is made on a single workload rather than possibly multiple workloads. Instead, I would think of preemption as one tool to satisfy the scheduling of a higher priority workload when lower priority ones are consuming resources. That may involve preemption (and rescheduling) parts of several workloads to meet, for example, the topology constraints of the higher priority workload.

I guess I am concerned that trying to implement "workload aware preemption" from the point of view of the workload being preempted is the wrong approach. Instead, I would expect that the scheduling cycle for the high priority workload would emit different plans (sort of like the placements @44past4 discusses, but with broader actions), and we would choose the lowest cost plan. That plan would take into account multiple workloads of varying priorities (this violates the "non-goal" of not worrying about rescheduling, btw). For example, if you had workloads A, B, C with that priority, and A had a "rack" topology constraint, you may end up with a plan like:

Preempt 20 pods from workload C across 5 different racks to make to free up devices for workload B. These workload C pods will sit in pending until more resource free up.

Move 20 pods from workload B out of rack 100 and spread them across the five racks freed up above

Schedule 40 pods from workload A that has a rack topology constraint onto the rack freed up by B pods

Is the goal of this KEP is just to define how workloads can advertise their preemption policies and constraints, and maybe some individual, localized decision making that can eventually roll up into a larger plan like that via different KEPs? If so I can see it as a step towards a smarter planner, that doesn't use "cascading preemptions" but instead constructs a multi-step plan. Maybe this will be come clear as I read further.

Instead, I would think of preemption as one tool to satisfy the scheduling of a higher priority workload when lower priority ones are consuming resources. That may involve preemption (and rescheduling) parts of several workloads to meet, for example, the topology constraints of the higher priority workload.

+100 to this - I don't think anything in the KEP is contradicting this.
Scheduling a workload may require preempting (potentially partially) multiple workloads.
The only point that we're making here is that when there are multiple options to achieve it we should try to take into account the cascading effect of the decision and prioritize for it.

I guess I am concerned that trying to implement "workload aware preemption" from the point of view of the workload being preempted is the wrong approach.

Let me clarify this principle - the goal is always scheduling the preemptor, the question is what should we preempt if there are multiple options and that is what the principle is above.

Is the goal of this KEP is just to define how workloads can advertise their preemption policies and constraints, and maybe some individual, localized decision making that can eventually roll up into a larger plan like that via different KEPs?

Right - so that is stated in the goals - let me clarify a bit better.
But the goal is a foundation that and concepts. The actual scoring will be fairly naive and improvements to that are not the goal of this KEP.

I tried to update the framing in the KEP.

johnbelamaric · 2025-12-18T20:19:16Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   same priority. We may decide to relax that assumption in the future follow-up enhancement.
+1. However, we start with separating the concepts of scheduling and preemption priorities from
+   the very beginning. The first one is simple generalization of pod priority concept. The
+   later allow for dynamically adjust the expected cost of preemption of a given workload.


s/later allow/latter allows/

+1 to this separation

keps/sig-scheduling/5710-workload-aware-preemption/README.md

johnbelamaric · 2025-12-18T20:27:29Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+const (
+  // PreemptionModePod means that individual pods can be preempted independently.
+  PreemptionModePod = "Pod"


It does seem like there's a middle ground between "any pod can be preempted, ignoring the atomicity of the pod group" and "preempt the whole podgroup". In fact it would seem pretty common that you might need to preempt a few pods from one group to make space (again thinking topological constraints) for a new workload, and that those pods from that podgroup would still need a place to start back up. In other words, the podgroup still needs to be able to be made complete, otherwise the cost metric should be "whole podgroup lost".

johnbelamaric · 2025-12-18T21:29:49Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+It's worth mentioning here, that we want to introduce the same defaulting rules for
+`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
+and there exists PriorityClass marked as `globalDefault`, we default it to that value.
+This consistency will allow us to properly handle when users are not setting neither pods


s/users are not setting/users set/

johnbelamaric · 2025-12-18T21:30:22Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+
+Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
+set at the pod level may not reflect priority used for preemption). We argue that its mutable
+nature makes it infeasible for reconsiling this information back to pods for scalability reasons


s/reconsiling/reconciling/

johnbelamaric · 2025-12-18T21:37:31Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+- workload C has scheduling priority `low` but preemption cost `high`
+In such case, the preemption cost would result in choosing workload B for preemption. But
+if it gets recreated, it will preempt workload C causing unnecessary cascading preemption.
+This is the reason why a cost-based model was discarded.


I don't think we need to assume the preemption decision is simply a "priority, then cost" decision, but could in fact be some function of them. I guess that's what you mean by "scoring". I think when you combine "cost" with non-isolated decisions, you can get a better result. By isolated, I mean, not choosing to consider A, B, and C all in the same scheduling decision for "A", but instead just pairwise decisions of "A" and "B" vs "A" and "C". From what I am understanding, the plan is to consider only the pairwise options; I think cascading preemptions may be inevitable in that case (or we may have to severely limit utility).

johnbelamaric · 2025-12-18T21:46:40Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+disruption.
+Note that this problem already exists in the current gang scheduling implementation. A given gang may
+not proceed with binding if the `minCount` pods from it can't be scheduled. But the preemptions are
+currently triggerred immediately after choosing a place for individual pods. So similarly as above,


s/triggerred/triggered/

johnbelamaric · 2025-12-18T21:50:38Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   feasible (e.g. because higher priority pods were scheduled in the meantime).
+
+The rationale behind the above design is to maintain the current scheduling property where preemption
+doesn't result in a committment for a particular placement. If a different possible placement appears


s/committment/commitment/

wojtek-t

I think that I addressed all the comments/concerns brought up so far. PTAL

wojtek-t · 2025-12-19T14:29:03Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   For pods being part of a workload, `PermitDisruptions` will be implemented by the already
+   existing `GangScheduling` plugin. The implemention will be a sibling to the existing
+   `Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
+   successfully nominated (or assumed) and only after satisfying this condition the preemptions
+   will be actuated.


I've updated the design to reflect it.

wojtek-t · 2025-12-19T14:30:03Z

keps/sig-scheduling/5710-workload-aware-preemption/README.md

+   `Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
+   successfully nominated (or assumed) and only after satisfying this condition the preemptions
+   will be actuated.
+   Note that as part of this change, we should treat an arbitrary preemption victim as effectively


I'm not sure I'm following, but with the updated version this is clearly no longer relevant concern.

k8s-ci-robot · 2025-12-19T14:32:30Z

@wojtek-t: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`30b916a`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 28, 2025

k8s-ci-robot requested review from dom4ha and macsko November 28, 2025 14:26

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Nov 28, 2025

github-project-automation bot added this to SIG Scheduling Nov 28, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 28, 2025

wojtek-t force-pushed the workload_aware_preemption branch from 672aa68 to ce04eca Compare December 1, 2025 08:21

Workload-aware preemption KEP

0ff3958

wojtek-t force-pushed the workload_aware_preemption branch from ce04eca to 0ff3958 Compare December 1, 2025 08:52

44past4 reviewed Dec 1, 2025

View reviewed changes

k8s-ci-robot assigned sanposhiho Dec 1, 2025

wojtek-t mentioned this pull request Dec 2, 2025

Workload-aware preemption #5710

Open

4 tasks

erictune suggested changes Dec 2, 2025

View reviewed changes

github-project-automation bot moved this to Needs Review in SIG Scheduling Dec 2, 2025

k8s-ci-robot requested a review from xigang December 3, 2025 00:55

Argh4k reviewed Dec 3, 2025

View reviewed changes

keps/sig-scheduling/5710-workload-aware-preemption/README.md Show resolved Hide resolved

Argh4k reviewed Dec 3, 2025

View reviewed changes

keps/sig-scheduling/5710-workload-aware-preemption/README.md Outdated Show resolved Hide resolved

macsko reviewed Dec 3, 2025

View reviewed changes

Argh4k reviewed Dec 4, 2025

View reviewed changes

wojtek-t commented Dec 4, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from 4293d98 to b24a962 Compare December 4, 2025 14:55

erictune approved these changes Dec 4, 2025

View reviewed changes

andreyvelich reviewed Dec 4, 2025

View reviewed changes

Argh4k reviewed Dec 5, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from b24a962 to 873a281 Compare December 5, 2025 13:04

sanposhiho reviewed Dec 7, 2025

View reviewed changes

x13n reviewed Dec 8, 2025

View reviewed changes

helayoty moved this from Needs Review to In Progress in SIG Scheduling Dec 10, 2025

andreyvelich mentioned this pull request Dec 11, 2025

WIP: KEP-4671: Introduce Workload Scheduling Cycle, graduate Workload API and gang scheduling to beta #5730

Open

Expand on review comments

6a2cc3f

wojtek-t force-pushed the workload_aware_preemption branch from 873a281 to 6a2cc3f Compare December 12, 2025 13:16

romanbaron reviewed Dec 16, 2025

View reviewed changes

atiratree reviewed Dec 16, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from 3854bbb to 305bad0 Compare December 17, 2025 15:09

wojtek-t commented Dec 17, 2025

View reviewed changes

kannon92 reviewed Dec 17, 2025

View reviewed changes

macsko reviewed Dec 18, 2025

View reviewed changes

Argh4k reviewed Dec 18, 2025

View reviewed changes

johnbelamaric reviewed Dec 18, 2025

View reviewed changes

wojtek-t force-pushed the workload_aware_preemption branch from 305bad0 to e8c515f Compare December 19, 2025 08:39

wojtek-t added 2 commits December 19, 2025 11:02

Improved delayed preemption design

3fc65fd

Few proposed actions in unresolved sections as plan of record

24ca8cf

wojtek-t force-pushed the workload_aware_preemption branch from e8c515f to 24ca8cf Compare December 19, 2025 10:02

Further redesign of delayed preemption

30b916a

wojtek-t commented Dec 19, 2025

View reviewed changes

		// IsGangPreemptable defines whether all pods from this group should
		// be preempted in all-or-nothing fashion.

		There is one direct implication of the above - the `pod.Spec.PriorityClassName` and `pod.Spec.Priority`
		may no longer reflect the actual pod priority, which could be misleading to users.

[KEP-5710]: Workload-aware preemption KEP #5711

Are you sure you want to change the base?

[KEP-5710]: Workload-aware preemption KEP #5711

Conversation

wojtek-t commented Nov 28, 2025

Uh oh!

wojtek-t commented Nov 28, 2025

Uh oh!

44past4 Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho commented Dec 1, 2025

Uh oh!

erictune left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xigang commented Dec 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Argh4k commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

44past4 Dec 1, 2025 •

edited

Loading

Argh4k commented Dec 4, 2025 •

edited

Loading