@@ -571,18 +571,42 @@ The last one would probably allow for the best unification.
571571<<[ /UNRESOLVED] >>
572572```
573573
574- 1. Introduce a new extension point to the Plugins - tentatively call it: `GetResources`.
575- Initially, it will only be implemented by the `DefaultPreemption` plugin and it will
576- actuate the preemptions. Longer term, the exact same mechanism can be used to provision
577- the new capacity (nodes, devices, ...) in the cluster.
578-
579- 1. Extend `SchedulingFramework` with two new steps: `RunGetResourcesPlugins` and
580- `WaitForGetResources`. These will be called immediately after `WaitOnPermit` phase and
581- before running `RunPreBindPlugins`. The `RunGetResourcesPlugins` will simply be calling
582- `GetResources` methods from all plugins implementing it. And `WaitForGetResources` will
583- work similarly to `WaitOnPermit`, serving as a barrier to ensure all the resources are
584- already available to use. The implementation will work similarly to `WaitOnPermit` to
585- ensure that `GetResources` was executed for all pods from within a `PodGroup`.
574+ 1. Introduce a new extension point to the Plugins - tentatively call it `PermitDisruption`.
575+ The idea behind it can be thought as an extension of scheduling vs binding cycle - we
576+ want to introduce a clear distinction between places where potentially disruptive decisions
577+ should be made from where these should be actuated.
578+ In particular, PostFilter should remain the point where preemption decisions should be
579+ made, but their actuation should be moved to the new `PermitDisruption`.
580+
581+ 1. The framework implementation will be adjusted in a way that `PermitDisruption` plugins
582+ will be called whenever `schedulingCycle` fails. The input to this plugin will include
583+ the `nomination` for this pod.
584+
585+ For individual pod (not being part of a workload) `PermitDisruption` will be implemented
586+ by the `DefaultPreemption` plugin and will simply actuate the previously computed
587+ preemption victims for a given pod.
588+
589+ For pods being part of a workload, `PermitDisruptions` will be implemented by the already
590+ existing `GangScheduling` plugin. The implemention will be a sibling to the existing
591+ `Permit` extension point - the plugin will be waiting for at least `gang.minPods` to be
592+ successfully nominated (or assumed) and only after satisfying this condition the preemptions
593+ will be actuated.
594+ Note that as part of this change, we should treat an arbitrary preemption victim as effectively
595+ blocking any pod to be scheduled. While this isn't a hard requirement, it doesn't introduce
596+ restrictions of already assumed pods if a different placement can be found in subsequent
597+ scheduling cycles.
598+
599+ 1. To reduce the number of unnessary preemptions, in case a preemption has already been triggerred
600+ and the already nominated placement remains valid, no new preemptions can be triggerred.
601+ In other words, a different placement can be chosen in a subsequent scheduling phases only if
602+ it doesn't require additional preemptions or the previously chosen placements is no longer
603+ feasible (e.g. because higher priority pods were scheduled in the meantime).
604+
605+ The rationale behind the above design is to maintain the current scheduling property where preemption
606+ doesn't result in a committment for a particular placement. If a different possible placement appears
607+ in the meantime (e.g. due to other pods terminating or new nodes appearing), subsequent scheduling
608+ attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling
609+ queue if these need to wait for preemption to become schedulable maintains that property.
586610
587611[Kubernetes Scheduling Races Handling]: https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w
588612
0 commit comments