You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4622-topologymanager-max-allowable-numa-nodes/README.md
+42-41
Original file line number
Diff line number
Diff line change
@@ -69,20 +69,19 @@ checklist items _must_ be updated for the enhancement to be released.
69
69
70
70
Items marked with (R) are required *prior to targeting to a milestone / release*.
71
71
72
-
-[ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
73
-
-[ ] (R) KEP approvers have approved the KEP status as `implementable`
74
-
-[ ] (R) Design details are appropriately documented
75
-
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
76
-
-[ ] e2e Tests for all Beta API Operations (endpoints)
77
-
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
78
-
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
79
-
-[ ] (R) Graduation criteria is in place
80
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
81
-
-[ ] (R) Production readiness review completed
82
-
-[ ] (R) Production readiness review approved
83
-
-[ ] "Implementation History" section is up-to-date for milestone
84
-
-[ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
85
-
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
72
+
-[X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
73
+
-[X] (R) KEP approvers have approved the KEP status as `implementable`
74
+
-[X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
75
+
-[X] e2e Tests for all Beta API Operations (endpoints)
76
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
77
+
-[X] (R) Minimum Two Week Window for GA e2e tests to prove flake free
78
+
-[X] (R) Graduation criteria is in place
79
+
-[X] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
80
+
-[X] (R) Production readiness review completed
81
+
-[X] (R) Production readiness review approved
82
+
-[X] "Implementation History" section is up-to-date for milestone
83
+
-[X] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
84
+
-[X] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
86
85
87
86
<!--
88
87
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
@@ -108,6 +107,7 @@ In this KEP, we propose a new TopologyManager Policy Option called `max-allowabl
108
107
- It does not attempt to remove the state explosion that still exists in the TopologyManager.
109
108
110
109
### User Stories (Optional)
110
+
111
111
#### Story 1
112
112
113
113
As a developer in the AI space, I want to use AI accelerators "super chips" which expose ARM cores with more than 8 NUMA nodes.
@@ -139,17 +139,17 @@ The risk associated with implementing this new proposal is minimal. It pertains
139
139
Users can configure the value of maxAllowableNUMANodes in the TopologyManager when the kubelet starts up, It will fail and abort if the user sets the value is lower than the current hardcoded default (8).
140
140
141
141
```go
142
-
caseMaxAllowableNUMANodes:
143
-
optValue, err:= strconv.Atoi(value)
144
-
if err != nil {
145
-
return opts, fmt.Errorf("bad value for option %q: %w", name, err)
146
-
}
147
-
opts.MaxAllowableNUMANodes = optValue
142
+
caseMaxAllowableNUMANodes:
143
+
optValue, err:= strconv.Atoi(value)
144
+
if err != nil {
145
+
return opts, fmt.Errorf("bad value for option %q: %w", name, err)
146
+
}
147
+
opts.MaxAllowableNUMANodes = optValue
148
148
...
149
149
150
-
if opts.MaxAllowableNUMANodes < defaultMaxAllowableNUMANodes {
151
-
return opts, fmt.Errorf("value for option %q is lower than defaultMaxAllowableNUMANodes: %d", MaxAllowableNUMANodes, opts.MaxAllowableNUMANodes)
152
-
}
150
+
if opts.MaxAllowableNUMANodes < defaultMaxAllowableNUMANodes {
151
+
return opts, fmt.Errorf("value for option %q is lower than defaultMaxAllowableNUMANodes: %d", MaxAllowableNUMANodes, opts.MaxAllowableNUMANodes)
-[x] Feature gate (also fill in values in `kep.yaml`)
338
-
- Feature gate name:
335
+
-[X] Feature gate (also fill in values in `kep.yaml`)
336
+
- Feature gate name:
339
337
-`TopologyManagerPolicyBetaOptions`
340
338
-`TopologyManagerPolicyOptions`
341
339
- Components depending on the feature gate: `kubelet`
342
-
-[x] Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes`
343
-
- Will enabling / disabling the feature require downtime of the control plane?
340
+
-[X] Change the kubelet configuration to set a TopologyManager policy of static and a TopologyManager policy option of `max-allowable-numa-nodes`
341
+
- Will enabling / disabling the feature require downtime of the control plane?
344
342
No
345
343
- Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
346
344
Yes -- kubelet restart is required.
345
+
347
346
###### Does enabling the feature change any default behavior?
348
347
349
348
<!--
350
349
Any change of default behavior may be surprising to users or break existing
351
350
automations, so be extremely careful here.
352
351
-->
353
352
354
-
No.
353
+
No.
355
354
356
355
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
357
356
@@ -414,7 +413,7 @@ What signals should users be paying attention to when the feature is young
414
413
that might indicate a serious problem?
415
414
-->
416
415
417
-
We have a metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
416
+
We have an existing metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
418
417
419
418
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
420
419
@@ -441,7 +440,7 @@ This section must be completed when targeting beta to a release.
441
440
For GA, this section is required: approvers should be able to confirm the
442
441
previous answers based on experience in the field.
443
442
-->
444
-
We add a metric: `kubelet_topology_manager_admission_time` for kubelet, which can be used to check if the setting is causing unacceptable performance drops.
443
+
An existing metric: `kubelet_topology_manager_admission_time` for kubelet can be used to check if the setting is causing unacceptable performance drops.
445
444
446
445
###### How can an operator determine if the feature is in use by workloads?
447
446
@@ -469,10 +468,10 @@ Recall that end users cannot usually observe component logs or access metrics.
469
468
-->
470
469
471
470
-[ ] Events
472
-
- Event Reason:
471
+
- Event Reason:
473
472
-[ ] API .status
474
-
- Condition name:
475
-
- Other field:
473
+
- Condition name:
474
+
- Other field:
476
475
-[x] Other (treat as last resort)
477
476
- Details: If their system has more than 8 NUMA nodes, the TopologyManager is turned on and the kubelet is not crashing, then the feature is working.
478
477
@@ -501,7 +500,7 @@ The value of max-allowable-numa-nodes does not (in and of itself) affect the lat
0 commit comments