Skip to content

Commit e8c515f

Browse files
committed
Few proposed actions in unresolved sections as plan of record
1 parent 99db7f3 commit e8c515f

File tree

1 file changed

+121
-53
lines changed
  • keps/sig-scheduling/5710-workload-aware-preemption

1 file changed

+121
-53
lines changed

keps/sig-scheduling/5710-workload-aware-preemption/README.md

Lines changed: 121 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
- [Workload priorities](#workload-priorities)
2222
- [Preemption algorithm](#preemption-algorithm)
2323
- [Delayed preemption](#delayed-preemption)
24+
- [Potential future extensions](#potential-future-extensions)
2425
- [Test Plan](#test-plan)
2526
- [Prerequisite testing updates](#prerequisite-testing-updates)
2627
- [Unit tests](#unit-tests)
@@ -130,23 +131,25 @@ and many others) and bring the true value for every Kubernetes user.
130131

131132
- Define API for describing units of preemption within a workload
132133
- Define API for describing priority of a preemption unit
133-
- Describe the principles and semantics of workload-aware preemption
134+
- Describe the principles and semantics of workload-aware preemption
135+
- Define the base preemption policies
134136
- Define the scheduler changes needed to implement workload-aware preemption
135137
- Provide full backward compatibility for all existing scheduling features
136138

137139
### Non-Goals
138140

139141
- Change the way how individual pods (not being part of workloads) are preempted
140142
- Provide the most optimal preemption algorithm from day 1
141-
- Address arbitrary preemption policies
143+
- Address arbitrary preemption policies (more preemption policies will be needed,
144+
but these should be added in a followup KEPs)
142145
- Introduce workload-awareness for handling different kinds of disruptions
143146
(e.g. caused by hardware failures)
144147
- Design rescheduling for workloads that will be preempted (rescheduling will
145148
be addressed in a separate dedicated KEP)
146149
- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it.
147150
If we decide to change that it will be addressed in a dedicated KEP.
148151
- Propose any tradeoff between preemption and cluster scale-up.
149-
- Design workload-level preemption triggerred by external schedulers
152+
- Design workload-level preemption triggered by external schedulers
150153

151154
## Proposal
152155

@@ -164,7 +167,10 @@ Disruption itself is never desired and this defines our core principles.
164167
will result in preempting another workload, the cost of that second preemption should be
165168
included in the cost of the original preemption.
166169

167-
The implication of the second one is that we should always try to avoid cascading preemptions.
170+
While cascading preemptions are inevitable in some cases (e.g. if high priority preemptor workload
171+
has very strict placement requirements), in general if there are multiple options of scheduling
172+
a higher priority workload with preemptions, with some of them being expected to cause cascading
173+
preemptions and others not, the later should be chosen.
168174

169175
### High-level approach
170176

@@ -181,7 +187,8 @@ pieces of the solution and discuss them in more detail in the following sections
181187
same priority. We may decide to relax that assumption in the future follow-up enhancement.
182188
1. However, we start with separating the concepts of scheduling and preemption priorities from
183189
the very beginning. The first one is simple generalization of pod priority concept. The
184-
later allow for dynamically adjust the expected cost of preemption of a given workload.
190+
later reflects the consequences of preemption and will eventually allow us for dynamic
191+
adjustments of those consequences over time.
185192
1. We start with a simple sub-optimal preemption algorithm that is based on the existing
186193
pod preemption algorithm used by kube-scheduler.
187194
1. We introduce a mechanism of "delayed preemption" to postpone actuation of preemption
@@ -231,17 +238,34 @@ This might be a good place to talk about core concepts and how they relate.
231238

232239
### Risks and Mitigations
233240

234-
<!--
235-
What are the risks of this proposal, and how do we mitigate? Think broadly.
236-
For example, consider both security and how this will impact the larger
237-
Kubernetes ecosystem.
241+
1. Extensibility - it's obvious that what is proposed in this KEP will not be a final step and we
242+
will be evolving it. How can we ensure that we will not put ourselves into a corner.
238243

239-
How will security be reviewed, and by whom?
244+
Mitigation: We enumerate potential extensions after the detailed design and briefly sketch
245+
how the proposed design can be extended to accomodate these.
240246

241-
How will UX be reviewed, and by whom?
247+
1. Incompatible scheduler profiles - different scheduling profiles may enable different sets of
248+
plugins and if only subset of profiles enable `GangScheduling` plugin (responsible also for
249+
workload-aware preemption), we may break the expectations.
250+
251+
Mitigation: We will document that `GangScheduling` plugin has to be enabled in all profiles
252+
or the logic will need to be reimplemented by other custom plugins. Eventually we may consider
253+
builtin validation, but we make it out of scope for this KEP.
254+
255+
1. Blocking preemptions - by setting very high preemption priority despite having relatively low
256+
scheduling priority one can make their low-priority workload effectively non-preemptable.
257+
258+
Mitigation: We will recomend cluster administrators to configure additional admission to
259+
prevent such cases (e.g. preemption priority cannot be higher than X from scheduling priority
260+
or preemption priority can be different than scheduling priority only for a subset of
261+
scheduling priorities).
262+
263+
1. Scalability - finding the optimal set of workloads/pods to preempt is computationally expensive
264+
problem, however we need to ensure it can be used even in the largest Kubernetes clusters.
265+
266+
Mitigation: We propose a simplified algorithm that is computationally feasible at the cost of
267+
providing "reasonably good" preemption victim candidates.
242268

243-
Consider including folks who also work outside the SIG or subproject.
244-
-->
245269

246270
## Design Details
247271

@@ -269,8 +293,13 @@ Based on that, we will extend the the existing `GangSchedulingPolicy` as followi
269293
// +enum
270294
type PreemptionMode string
271295

296+
<<[UNRESOLVED PremptionMode vs DisruptionMode]>>
297+
Should we rename it to DisruptionMode to allow reusing it e.g. in the EvictionRequest API?
298+
<<[/UNRESOLVED]>>
299+
272300
const (
273301
// PreemptionModePod means that individual pods can be preempted independently.
302+
// It doesn't depend on exact set of Pods currently running in this PodGroup.
274303
PreemptionModePod = "Pod"
275304
// PreemptionModePodGroup means that the whole PodGroup replica needs to be
276305
// preempted together.
@@ -299,14 +328,20 @@ e.g. preempt our workload. But intuition is not enough here.
299328

300329
As described in user stories above, a simple static priority doesn't seem to be enough. Arguably it is
301330
not even a single priority because a priority used for scheduling can be different than the priority
302-
that should be used for preemption. So in the ideal world a workload owner should be able to define:
331+
that should be used for preemption. So in the ideal world a workload owner should be able to:
303332

304-
- priority used for scheduling (potentially also separately for every PodGroup)
305-
- priority used for preemption (again potentially also for every PodGroup) and be able to arbitrarily
306-
mutate it during the whole lifecycle of the workload
333+
- define priority used for scheduling (potentially also separately for every PodGroup)
334+
- define priority used for preemption (again potentially also for every PodGroup)
335+
- mutate preemption priority during the whole lifecycle of the workload to reflect the importance
336+
of that workload when it's running
307337

308-
We start simpler though and assume that all PodGroups have the same scheduling and preemption
309-
policy by extending the `Workload` API as following:
338+
However, while we believe that all of these are eventuall needed, we start simpler by:
339+
- assuming all PodGroups within a Workload have the same scheduling and preemption priorities
340+
(see the Potential future extensions sections on how this can be relaxed later)
341+
- starting with static preemption priority (mutability brings additional complexity that is
342+
purely additive and thus should be added in a follow-up KEP)
343+
344+
The propose `Workload` API extensions look as following.
310345

311346
```golang
312347
type WorkloadSpec struct {
@@ -327,7 +362,7 @@ type WorkloadSpec struct {
327362
// priority that should be used when attempting to preempt this workload.
328363
// If not specified, it will default to PriorityClassName.
329364
//
330-
// This field is mutable.
365+
// This field is immutable.
331366
PreemptionPriorityClassName *string
332367
}
333368
```
@@ -351,7 +386,7 @@ There are several options we can approach it (from least to most invasive):
351386
which priority doesn't match the priority of the workload object.
352387
- Introducing an admission to validate that if a pod is referencing a workload object, its
353388
`pod.Spec.PriorityClassName` equals `workload.Spec.PriorityClassName`. However, we allow creating
354-
pods before the workload object, and there don't see, to be an easy way to avoid races.
389+
pods before the workload object, and there doesn't seem to be an easy way to avoid races.
355390
- Making `pod.Spec.PriorityClassName` and `pod.Spec.Priority` mutable fields and having a new
356391
workload controller responsible for reconciling these. However, that could introduce another
357392
divergence between the priority of pods and the priority defined in the PodTemplate in true
@@ -367,14 +402,14 @@ Workload status (second option) and potentially improving it later.
367402
It's worth mentioning here, that we want to introduce the same defaulting rules for
368403
`workload.Spec.PriorityClassName` that we have for pods. Namely, if `PriorityClassName` is unset
369404
and there exists PriorityClass marked as `globalDefault`, we default it to that value.
370-
This consistency will allow us to properly handle when users are not setting neither pods
405+
This consistency will allow us to properly handle cases when users set neither pods
371406
nor workload priorities.
372407
Similarly, we will ensure that `PriorityClass.preemptionPolicy` works exactly the same way for
373408
workloads as for pods. Such level of consistency would make adoption of Workload API much easier.
374409

375410
Moving to `PreemptionPriorityClassName`, the same issue of confusion holds (the actual priority
376411
set at the pod level may not reflect priority used for preemption). We argue that its mutable
377-
nature makes it infeasible for reconsiling this information back to pods for scalability reasons
412+
nature makes it infeasible for reconciling this information back to pods for scalability reasons
378413
(we can absolutely handle frequent updates to `Workload.Spec.PreemptionPriorityClassName`,
379414
but we can't handle updating potentially hundreds or thousands of pods within that workload
380415
that frequently). So in this case, we limit ourselves to documentation.
@@ -411,13 +446,25 @@ not higher then preemption policy.
411446
<<[/UNRESOLVED]>>
412447
```
413448

414-
```
415-
<<[UNRESOLVED priority status]>>
416-
We should introduce/describe `workload.status` to reflect:
417-
- actual priority (a number) that is inferred from the priority class
418-
- express the preemption priority that is currently used by scheduler to reflect that
419-
it aknowledged the change
420-
<<[/UNRESOLVED]>>
449+
Given that components operate on integer priorities, we will introduce a corresponding fields
450+
that reflect priority and preemption priority of a workload (similarly to how it's done in
451+
Pod API). However, since these are derivatives of the fields introduced above and to allow
452+
future mutability of `PreemptionPriorityClassName`, we propose introducing them as as part
453+
of status:
454+
455+
```golang
456+
type WorkloadStatus struct {
457+
// Priority reflects the priority of the workload.
458+
// The higher value, the higher the priority.
459+
// This field is populated from the PriorityClassName.
460+
Priority *int32
461+
462+
// PreemptionPriority reflects the priority of the workload when it is
463+
// considered for preemption.
464+
// The higher value, the higher the priority.
465+
// This field is populated from the PreemptionPriorityClassName.
466+
PreemptionPriority *int32
467+
}
421468
```
422469

423470
### Preemption algorithm
@@ -488,17 +535,14 @@ with preemption:
488535
these can be placed in the exact same place they are running now. If they can we simply leave
489536
them where they are running now and remove from the potential victims list.
490537

491-
```
492-
<<[UNRESOLVED PodDisruptionBudget violations]>>
493-
The above reprieval works identically to current algorithm if the domain D is a single node.
494-
For larger domains, different placements of a preemptor are potentially possible and may result
495-
in potentially different sets of victims violating PodDisruptionBudgets to remain feasible.
496-
This means that the above algorithm is not optimizing for minimizing the number of victims that
497-
would violate their PodDisruptionBudgets.
498-
However, we claim that algorithm optimizing for it would be extremely expensive computationally
499-
and propose to stick with this simple version at least for a foreseable future.
500-
<<[/UNRESOLVED]
501-
```
538+
For domain D being a single node (current pod-based preemption), the above algorithm works
539+
identically to the current algorithm. For larger domains, different placements of a preemptor
540+
are potentially possible and may result in potentially different sets of victims violating
541+
PodDisruptionBudgets to remain feasible. This means that the proposed algorithm is not necessarily
542+
minimizing the number of victims that would violate their PodDisruptionBudgets.
543+
However, optimizing for it would be extremely expensive computationally so to not significantly
544+
hurt performance we propose to accept this limitation (if needed a better algorithm may be
545+
proposed as a separate KEP).
502546

503547
1. For the remaining potential victims, using binary search across priorities find the minimal
504548
priority N for which scheduling the preemptor can be achieved without preempting any victims
@@ -512,17 +556,9 @@ with preemption:
512556
if they can be placed where they are currently running. If so assume it back and remove from
513557
potential victims list.
514558

515-
```
516-
<<[UNRESOLVED minimizing preemptions]>>
517-
The above algorithm is definitely non optimal, but is (a) compatible with the current pod-based
518-
algorithm (b) computationally feasible (c) simple to reason about.
519-
As a result, I suggest that we proceed with it at least as a starting point.
520-
521-
As a bonus we may consider few potential placements of the preemptor and choose the one that
522-
somehow optimizes the number of victims. However, that will appear to be more critical once we
523-
get to Topology-Aware-Scheduling and I would leave that improvement until then.
524-
<<[/UNRESOLVED]>>
525-
```
559+
We acknowledge the fact that the above algorithm is not optimal, but (a) is compatible with the
560+
current pod-based one, (b) is computationally feasible, (c) is simple to reason about. We will
561+
proceed with it and may consider improvements in a follow-up KEPs in the future.
526562

527563
1. We score scheduling decisions for each of the domains and choose the best one. The exact criteria
528564
for that will be figured out during the implementation phase.
@@ -610,9 +646,41 @@ queue if these need to wait for preemption to become schedulable maintains that
610646

611647
[Kubernetes Scheduling Races Handling]: https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w
612648

613-
614649
[API Design For Gang and Workload-Aware Scheduling]: https://tiny.cc/hvhs001
615650

651+
### Potential future extensions
652+
653+
Here we discuss a couple of extensions that we envision just to ensure that we can build them
654+
in an additive and backward-compatible way. The approval of this KEP doesn't mean an approval
655+
for any of those and proceeding with any of these will require dedicated KEP(s) in the future.
656+
657+
1. Improved preemption algorithm.
658+
659+
Instead of considering a single placement of a preemptor for a given set of victims, we may
660+
consider multiple different placements. This will have much bigger impact once kube-scheduler
661+
supports topology-aware scheduling. As a result, we're leaving it as a future extension -
662+
the algorithm can always be improved and will result in pretty local code changes.
663+
664+
1. Non-uniform priority across PodGroups.
665+
666+
As already signaled above, we predict the need for different PodGroups to have different
667+
priorities. As an extension, we can even envision introducing `PodSubGroup` concept and
668+
a case where different `PodSubGroups` have different priorities.
669+
To achieve that, we could introduce `PriorityClassName` field also at the `PodGroup` (and
670+
potentially also at `PodSubGroup`) level, with the semantic that lower-level structure
671+
overwrites the higher-level one (e.g. priority set for `PodGroup` overwrites the priority
672+
for `Workload`). So the API and semantics proposed in this KEP would allow for achieving
673+
it in backward compatible way.
674+
675+
1. Non-uniform PodGroups
676+
677+
In addition to non-uniform priorities, we may expect other non-uniform behaviors. As an
678+
example consider `LeaderWorkerSet` and a usecase where we allow for preempting individual
679+
workers (with a given unit working in a degraded mode), but don't allow for preempting a
680+
leader. The enum-based `PreemptionMode` allows for introducing more sophisticated policies
681+
(e.g. only a subset of `PodSubGroups` can be preempted).
682+
683+
616684
### Test Plan
617685

618686
<!--

0 commit comments

Comments
 (0)