Skip to content

Commit 99db7f3

Browse files
committed
Improved delayed preemption design
1 parent 6a2cc3f commit 99db7f3

File tree

1 file changed

+36
-12
lines changed
  • keps/sig-scheduling/5710-workload-aware-preemption

1 file changed

+36
-12
lines changed

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -571,18 +571,42 @@ The last one would probably allow for the best unification.
571571
<<[/UNRESOLVED]>>
572572
```
573573
574-
1. Introduce a new extension point to the Plugins - tentatively call it: `GetResources`.
575-
Initially, it will only be implemented by the `DefaultPreemption` plugin and it will
576-
actuate the preemptions. Longer term, the exact same mechanism can be used to provision
577-
the new capacity (nodes, devices, ...) in the cluster.
578-
579-
1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
580-
`WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
581-
before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
582-
`GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
583-
work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
584-
already available to use. The implementation will work similarly to `WaitOnPermit` to
585-
ensure that `GetResources` was executed for all pods from within a `PodGroup`.
574+
1. Introduce a new extension point to the Plugins - tentatively call it `PermitDisruption`.
575+
The idea behind it can be thought as an extension of scheduling vs binding cycle - we
576+
want to introduce a clear distinction between places where potentially disruptive decisions
577+
should be made from where these should be actuated.
578+
In particular, PostFilter should remain the point where preemption decisions should be
579+
made, but their actuation should be moved to the new `PermitDisruption`.
580+
581+
1. The framework implementation will be adjusted in a way that `PermitDisruption` plugins
582+
will be called whenever `schedulingCycle` fails. The input to this plugin will include
583+
the `nomination` for this pod.
584+
585+
For individual pod (not being part of a workload) `PermitDisruption` will be implemented
586+
by the `DefaultPreemption` plugin and will simply actuate the previously computed
587+
preemption victims for a given pod.
588+
589+
For pods being part of a workload, `PermitDisruptions` will be implemented by the already
590+
existing `GangScheduling` plugin. The implemention will be a sibling to the existing
591+
`Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
592+
successfully nominated (or assumed) and only after satisfying this condition the preemptions
593+
will be actuated.
594+
Note that as part of this change, we should treat an arbitrary preemption victim as effectively
595+
blocking any pod to be scheduled. While this isn't a hard requirement, it doesn't introduce
596+
restrictions of already assumed pods if a different placement can be found in subsequent
597+
scheduling cycles.
598+
599+
1. To reduce the number of unnessary preemptions, in case a preemption has already been triggerred
600+
and the already nominated placement remains valid, no new preemptions can be triggerred.
601+
In other words, a different placement can be chosen in a subsequent scheduling phases only if
602+
it doesn't require additional preemptions or the previously chosen placements is no longer
603+
feasible (e.g. because higher priority pods were scheduled in the meantime).
604+
605+
The rationale behind the above design is to maintain the current scheduling property where preemption
606+
doesn't result in a committment for a particular placement. If a different possible placement appears
607+
in the meantime (e.g. due to other pods terminating or new nodes appearing), subsequent scheduling
608+
attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling
609+
queue if these need to wait for preemption to become schedulable maintains that property.
586610
587611
[Kubernetes Scheduling Races Handling]: https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w
588612

0 commit comments

Comments
 (0)