Skip to content

Commit 1010aff

Browse files
committed
kep-4622: promote topologyMnagaer policy: max-allowable-numa-nodes to GA
Signed-off-by: Cyclinder Kuo <[email protected]>
1 parent bfe5e39 commit 1010aff

File tree

3 files changed

+55
-46
lines changed

3 files changed

+55
-46
lines changed

Diff for: keps/prod-readiness/sig-node/4622.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4622
22
beta:
33
approver: "@jpbetz"
4+
stable:
5+
approver: "@jpbetz"

Diff for: keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md

+42-41
Original file line numberDiff line numberDiff line change
@@ -69,20 +69,19 @@ checklist items _must_ be updated for the enhancement to be released.
6969

7070
Items marked with (R) are required *prior to targeting to a milestone / release*.
7171

72-
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
73-
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
74-
- [ ] (R) Design details are appropriately documented
75-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
76-
- [ ] e2e Tests for all Beta API Operations (endpoints)
77-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
78-
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
79-
- [ ] (R) Graduation criteria is in place
80-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
81-
- [ ] (R) Production readiness review completed
82-
- [ ] (R) Production readiness review approved
83-
- [ ] "Implementation History" section is up-to-date for milestone
84-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
85-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
72+
- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
73+
- [X] (R) KEP approvers have approved the KEP status as `implementable`
74+
- [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
75+
- [X] e2e Tests for all Beta API Operations (endpoints)
76+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
77+
- [X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
78+
- [X] (R) Graduation criteria is in place
79+
- [X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
80+
- [X] (R) Production readiness review completed
81+
- [X] (R) Production readiness review approved
82+
- [X] "Implementation History" section is up-to-date for milestone
83+
- [X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
84+
- [X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
8685

8786
<!--
8887
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -108,6 +107,7 @@ In this KEP, we propose a new TopologyManager Policy Option called `max-allowabl
108107
- It does not attempt to remove the state explosion that still exists in the TopologyManager.
109108

110109
### User Stories (Optional)
110+
111111
#### Story 1
112112

113113
As a developer in the AI space, I want to use AI accelerators "super chips" which expose ARM cores with more than 8 NUMA nodes.
@@ -139,17 +139,17 @@ The risk associated with implementing this new proposal is minimal. It pertains
139139
Users can configure the value of maxAllowableNUMANodes in the TopologyManager when the kubelet starts up, It will fail and abort if the user sets the value is lower than the current hardcoded default (8).
140140

141141
```go
142-
case MaxAllowableNUMANodes:
143-
optValue, err := strconv.Atoi(value)
144-
if err != nil {
145-
return opts, fmt.Errorf("bad value for option %q: %w", name, err)
146-
}
147-
opts.MaxAllowableNUMANodes = optValue
142+
case MaxAllowableNUMANodes:
143+
optValue, err := strconv.Atoi(value)
144+
if err != nil {
145+
return opts, fmt.Errorf("bad value for option %q: %w", name, err)
146+
}
147+
opts.MaxAllowableNUMANodes = optValue
148148
...
149149

150-
if opts.MaxAllowableNUMANodes < defaultMaxAllowableNUMANodes {
151-
return opts, fmt.Errorf("value for option %q is lower than defaultMaxAllowableNUMANodes: %d", MaxAllowableNUMANodes, opts.MaxAllowableNUMANodes)
152-
}
150+
if opts.MaxAllowableNUMANodes < defaultMaxAllowableNUMANodes {
151+
return opts, fmt.Errorf("value for option %q is lower than defaultMaxAllowableNUMANodes: %d", MaxAllowableNUMANodes, opts.MaxAllowableNUMANodes)
152+
}
153153
```
154154

155155
### Test Plan
@@ -165,7 +165,7 @@ when drafting this test plan.
165165
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
166166
-->
167167

168-
[x] I/we understand the owners of the involved components may require updates to
168+
[X] I/we understand the owners of the involved components may require updates to
169169
existing tests to make this code solid enough prior to committing the changes necessary
170170
to implement this enhancement.
171171

@@ -234,10 +234,6 @@ For beta:
234234

235235
- Verify the input validation with the existing e2e tests(e.g. 9 or 10 or something bigger than the current default but not "too big")
236236

237-
For GA:
238-
239-
- degrading the node and checking the node is reported as degraded
240-
241237
### Graduation Criteria
242238

243239
#### Beta
@@ -249,7 +245,7 @@ For GA:
249245

250246
#### GA
251247

252-
- Add a metrics: `kubelet_topology_manager_admission_time`.
248+
- An existing metric: `kubelet_topology_manager_admission_time` can be used.
253249

254250
### Upgrade / Downgrade Strategy
255251

@@ -313,11 +309,13 @@ you need any help or guidance.
313309
This section must be completed when targeting alpha to a release.
314310
-->
315311
1.31:
312+
316313
- enable by default
317314
- allow gate to disable the feature
318315
- release note
319316

320-
1.32:
317+
1.33:
318+
321319
- promote to GA
322320
- cannot be disabled
323321
- release note
@@ -334,24 +332,25 @@ well as the [existing list] of feature gates.
334332
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
335333
-->
336334

337-
- [x] Feature gate (also fill in values in `kep.yaml`)
338-
- Feature gate name:
335+
- [X] Feature gate (also fill in values in `kep.yaml`)
336+
- Feature gate name:
339337
- `TopologyManagerPolicyBetaOptions`
340338
- `TopologyManagerPolicyOptions`
341339
- Components depending on the feature gate: `kubelet`
342-
- [x] Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes`
343-
- Will enabling / disabling the feature require downtime of the control plane?
340+
- [X] Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes`
341+
- Will enabling / disabling the feature require downtime of the control plane?
344342
No
345343
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
346344
Yes -- kubelet restart is required.
345+
347346
###### Does enabling the feature change any default behavior?
348347

349348
<!--
350349
Any change of default behavior may be surprising to users or break existing
351350
automations, so be extremely careful here.
352351
-->
353352

354-
No.
353+
No.
355354

356355
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
357356

@@ -414,7 +413,7 @@ What signals should users be paying attention to when the feature is young
414413
that might indicate a serious problem?
415414
-->
416415

417-
We have a metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
416+
We have an existing metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
418417

419418
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
420419

@@ -441,7 +440,7 @@ This section must be completed when targeting beta to a release.
441440
For GA, this section is required: approvers should be able to confirm the
442441
previous answers based on experience in the field.
443442
-->
444-
We add a metric: `kubelet_topology_manager_admission_time` for kubelet, which can be used to check if the setting is causing unacceptable performance drops.
443+
An existing metric: `kubelet_topology_manager_admission_time` for kubelet can be used to check if the setting is causing unacceptable performance drops.
445444

446445
###### How can an operator determine if the feature is in use by workloads?
447446

@@ -469,10 +468,10 @@ Recall that end users cannot usually observe component logs or access metrics.
469468
-->
470469

471470
- [ ] Events
472-
- Event Reason:
471+
- Event Reason:
473472
- [ ] API .status
474-
- Condition name:
475-
- Other field:
473+
- Condition name:
474+
- Other field:
476475
- [x] Other (treat as last resort)
477476
- Details: If their system has more than 8 NUMA nodes, the TopologyManager is turned on and the kubelet is not crashing, then the feature is working.
478477

@@ -501,7 +500,7 @@ The value of max-allowable-numa-nodes does not (in and of itself) affect the lat
501500
Pick one more of these and delete the rest.
502501
-->
503502

504-
- [ ] Metrics
503+
- [X] Metrics
505504
- Metric name: kubelet_topology_manager_admission_time
506505
- Components exposing the metric: kubelet
507506

@@ -658,6 +657,7 @@ details). For now, we leave it here.
658657
-->
659658

660659
###### How does this feature react if the API server and/or etcd is unavailable?
660+
661661
N/A
662662

663663
###### What are other known failure modes?
@@ -683,6 +683,7 @@ Major milestones might include:
683683

684684
- 2024-05-08 - initial KEP draft created
685685
- 2024-06-06 - updates per review feedback
686+
- 2025-02-12 - promote it to GA
686687

687688
## Drawbacks
688689

Diff for: keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/kep.yaml

+11-5
Original file line numberDiff line numberDiff line change
@@ -5,27 +5,33 @@ authors:
55
owning-sig: sig-node
66
participating-sigs: []
77
status: implementable
8-
creation-date: "2024-05-08"
8+
creation-date: "2025-02-15"
99
reviewers:
1010
- "@klueska"
1111
- "@ffromani"
1212
approvers:
1313
- "@sig-node-tech-leads"
14-
see-also: []
14+
see-also:
15+
- "keps/sig-node/2625-cpumanager-policies-thread-placement/"
16+
- "keps/sig-node/2902-cpumanager-distribute-cpus-policy-option/"
17+
- "keps/sig-node/3545-improved-multi-numa-alignment/"
18+
- "keps/sig-node/4176-cpumanager-spread-cpus-preferred-policy/"
19+
- "keps/sig-node/4540-strict-cpu-reservation"
20+
- "keps/sig-node/4800-cpumanager-split-uncorecache/"
1521
replaces: []
1622

1723
# The target maturity stage in the current dev cycle for this KEP.
18-
stage: beta
24+
stage: stable
1925

2026
# The most recent milestone for which work toward delivery of this KEP has been
2127
# done. This can be the current (upcoming) milestone, if it is being actively
2228
# worked on.
23-
latest-milestone: "v1.31"
29+
latest-milestone: "v1.33"
2430

2531
# The milestone at which this feature was, or is targeted to be, at each stage.
2632
milestone:
2733
beta: "v1.31"
28-
stable: "v1.32"
34+
stable: "v1.33"
2935

3036
# The following PRR answers are required at alpha release
3137
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)