-
Notifications
You must be signed in to change notification settings - Fork 1.6k
kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA #5166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cyclinder
commented
Feb 12, 2025
- One-line PR description: topologyMnagaer policy: max-allowable-numa-nodes to GA(1.33)
- Issue link: KEP-4622: Add a TopologyManager policy option for MaxAllowableNUMANodes #4622
- Other comments:
- it was created in 1.31 as a beta feature: KEP-4622: New TopologyManager Policy: max-allowable-numa-nodes #4624
- The GA graduation would mean:
- the max-allowable-numa-nodes option will graduate to GA
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cyclinder The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial review. We're very close to the deadline, but let's try to have this in
|
||
[kubernetes.io]: https://kubernetes.io/ | ||
[kubernetes/enhancements]: https://git.k8s.io/enhancements | ||
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vscode remove it😅, revert it now
For GA: | ||
|
||
- degrading the node and checking the node is reported as degraded | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda don't remember the context here, do we need this e2e to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was about making the state of the feature observable to others. We already has issues reported in this area: kubernetes/kubernetes#131738
With the new sig-arch requirements effective 1.34, if we do this for GA we need to have another beta
[EDIT] point is: exactly like this issue demonstrate, we have this performance issue already. So the slowdown is not caused by the maximum allowed, but rather by how the topology manager currently handles machines with high NUMA zone count. In hindsight, I don't think degrading the node is something that add values because is not related to this setting. Actually the degradation, or the signal in general, seems to be better represented by the existing metric about the admission time. I can only imagine as possible improvement to extend that metric.
#### GA | ||
|
||
- Add a metrics: `kubelet_topology_manager_admission_time`. | ||
- Add a metric: `kubelet_topology_manager_admission_time`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a note documenting we can use an existing metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this metric in the kubelet code.
- Feature gate name: | ||
- `TopologyManagerPolicyBetaOptions` | ||
- `TopologyManagerPolicyOptions` | ||
- `TopologyManagerPolicyOptions` - going to be locked to true once GA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to mention the locking, it's the standard process for each feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
# List the feature gate name and the components for which it must be enabled | ||
feature-gates: | ||
- name: "TopologyManagerPolicyBetaOptions" | ||
components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious too, maybe vscode's markdown linter removed it, revert it now.
I am supportive of graduating features to GA but not sure if we will get PRR in time as this came in late. |
same. I'll still be helping here with the reviews and cleaning up the feature, but if we miss the deadline (as unfortunately seems likely) is not too bad: we can prepare and be ready for a early 1.34 merge. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/retitle kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA |
- Feature gate name: | ||
- [X] Feature gate (also fill in values in `kep.yaml`) | ||
- Feature gate name: | ||
- `TopologyManagerPolicyBetaOptions` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ffromani what should one do here for GA policies with TopologyManager?
I take it that this is no longer toggleable via beta options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove TopologyManagerPolicyBetaOptions
, leaving only TopologyManagerPolicyOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sig-node the feature itself is done. I will defer observability concerns to PRR review
For GA: | ||
|
||
- degrading the node and checking the node is reported as degraded | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that was about making the state of the feature observable to others. We already has issues reported in this area: kubernetes/kubernetes#131738
With the new sig-arch requirements effective 1.34, if we do this for GA we need to have another beta
[EDIT] point is: exactly like this issue demonstrate, we have this performance issue already. So the slowdown is not caused by the maximum allowed, but rather by how the topology manager currently handles machines with high NUMA zone count. In hindsight, I don't think degrading the node is something that add values because is not related to this setting. Actually the degradation, or the signal in general, seems to be better represented by the existing metric about the admission time. I can only imagine as possible improvement to extend that metric.
#### GA | ||
|
||
- Add a metrics: `kubelet_topology_manager_admission_time`. | ||
- An existing metric: `kubelet_topology_manager_admission_time` can be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check pkg/kubelet/metrics/metrics.go
. I think it is topology_manager_admission_duration_ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> | ||
|
||
We have a metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`. | ||
We have an existing metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto about the metric name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
previous answers based on experience in the field. | ||
--> | ||
We add a metric: `kubelet_topology_manager_admission_time` for kubelet, which can be used to check if the setting is causing unacceptable performance drops. | ||
An existing metric: `kubelet_topology_manager_admission_time` for kubelet can be used to check if the setting is causing unacceptable performance drops. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cyclinder could you kindly check you are using the latest KEP template? https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Just curious about the current state of e2e test for this feature.
Do we already have tests in the codebase to ensure that this policy option works as expected and is compatible with other Topology Manager policy options or planned as part of GA graduation?
Agree, do you mean that we need another metric for this option? We already have an existing metric:
We already have an e2e test for this in kubernetes/kubernetes#124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L257 |
- "@klueska" | ||
- "@ffromani" | ||
approvers: | ||
- "@sig-node-tech-leads" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be @klueska to match the OWNERS file in this directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.
1.34: | ||
|
||
- promote to GA | ||
- cannot be disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably means LockToDefault: true
. OK for me and remove in 1.35. The feature gate enables a flag disabled by default which is opt-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks.
will rollout across nodes. | ||
--> | ||
|
||
Rollout or rollout fail do not impact already running workloads, only impact the new workloads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general additional comment. This is subtler. Rolling out this option would mean removing the flag from the kubelet. Depending on the hardware, this will prevent the kubelet to start. I don't think rollout concerns really apply to this specific feature.
If you have a a machine wiht 9+ NUMA nodes AND you want to use a topology manager policy which is not None, THEN you need this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I will update the KEP to address your comments.
- Testing: Are there any tests for failure mode? If not, describe why. | ||
--> | ||
|
||
Setting a value lower 8 causes kubelet crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe improve it like:
keeping the default value will cause the kubelet to fail to start on machines with 9 or more NUMA cells if any but the `none` topology manager policy is also configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Cyclinder Kuo <[email protected]>
I am open to having some other e2e tests, Do we need to do this in 1.34 or 1.35? |
from what I collected, it's fair and accepted to add e2e tests in the same cycle on which we graduate to beta. But everything should be set before to move to GA. |
looks like the label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to #5242 my understanding is we need to have another beta to complete the missing e2e tests. Other than that LGTM me. Not adding the label because the PR wants to move to GA, and turns out we need a beta2 to address the gaps
|
||
<!-- | ||
This section is for explicitly listing the motivation, goals, and non-goals of | ||
this KEP. Describe why the change is important and the benefits to users. The | ||
motivation section can optionally provide links to [experience reports] to | ||
demonstrate the interest in a KEP within the wider Kubernetes community. | ||
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you resubmit, please remove comments like this from the completed sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reminder.
According to my understanding, I need to submit another PR, similar to #5242, and also add an e2e test to Kubernetes. Only after these are completed can this PR receive the required label and be merged, right? |
We already have an e2e test for this in kubernetes/kubernetes#124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L1449. . This is the most basic verification that the changed setting is not breaking stuff for maxNUMANodes(16), Do we still need an additional e2e test? |