Skip to content

Conversation

cyclinder
Copy link
Contributor

  • One-line PR description: topologyMnagaer policy: max-allowable-numa-nodes to GA(1.33)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cyclinder
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2025
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review. We're very close to the deadline, but let's try to have this in


[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor Author

@cyclinder cyclinder Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vscode remove it😅, revert it now

Comment on lines -237 to -240
For GA:

- degrading the node and checking the node is reported as degraded

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda don't remember the context here, do we need this e2e to test it?

Copy link
Contributor

@ffromani ffromani Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was about making the state of the feature observable to others. We already has issues reported in this area: kubernetes/kubernetes#131738

With the new sig-arch requirements effective 1.34, if we do this for GA we need to have another beta

[EDIT] point is: exactly like this issue demonstrate, we have this performance issue already. So the slowdown is not caused by the maximum allowed, but rather by how the topology manager currently handles machines with high NUMA zone count. In hindsight, I don't think degrading the node is something that add values because is not related to this setting. Actually the degradation, or the signal in general, seems to be better represented by the existing metric about the admission time. I can only imagine as possible improvement to extend that metric.

#### GA

- Add a metrics: `kubelet_topology_manager_admission_time`.
- Add a metric: `kubelet_topology_manager_admission_time`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a note documenting we can use an existing metric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this metric in the kubelet code.

- Feature gate name:
- `TopologyManagerPolicyBetaOptions`
- `TopologyManagerPolicyOptions`
- `TopologyManagerPolicyOptions` - going to be locked to true once GA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to mention the locking, it's the standard process for each feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: "TopologyManagerPolicyBetaOptions"
components:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious too, maybe vscode's markdown linter removed it, revert it now.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 13, 2025

I am supportive of graduating features to GA but not sure if we will get PRR in time as this came in late.

@ffromani
Copy link
Contributor

I am supportive of graduating features to GA but not sure if we will get PRR in time as this came in late.

same. I'll still be helping here with the reviews and cleaning up the feature, but if we miss the deadline (as unfortunately seems likely) is not too bad: we can prepare and be ready for a early 1.34 merge.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2025
@ffromani
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2025
@ffromani
Copy link
Contributor

/retitle kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA

@k8s-ci-robot k8s-ci-robot changed the title kep-4622: promote topologyMnagaer policy: max-allowable-numa-nodes to GA kep-4622: promote topologyManager policy: max-allowable-numa-nodes to GA May 14, 2025
- Feature gate name:
- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name:
- `TopologyManagerPolicyBetaOptions`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffromani what should one do here for GA policies with TopologyManager?

I take it that this is no longer toggleable via beta options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove TopologyManagerPolicyBetaOptions , leaving only TopologyManagerPolicyOptions

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from sig-node the feature itself is done. I will defer observability concerns to PRR review

Comment on lines -237 to -240
For GA:

- degrading the node and checking the node is reported as degraded

Copy link
Contributor

@ffromani ffromani Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was about making the state of the feature observable to others. We already has issues reported in this area: kubernetes/kubernetes#131738

With the new sig-arch requirements effective 1.34, if we do this for GA we need to have another beta

[EDIT] point is: exactly like this issue demonstrate, we have this performance issue already. So the slowdown is not caused by the maximum allowed, but rather by how the topology manager currently handles machines with high NUMA zone count. In hindsight, I don't think degrading the node is something that add values because is not related to this setting. Actually the degradation, or the signal in general, seems to be better represented by the existing metric about the admission time. I can only imagine as possible improvement to extend that metric.

#### GA

- Add a metrics: `kubelet_topology_manager_admission_time`.
- An existing metric: `kubelet_topology_manager_admission_time` can be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check pkg/kubelet/metrics/metrics.go. I think it is topology_manager_admission_duration_ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-->

We have a metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
We have an existing metric which records the topology manager admission time: `kubelet_topology_manager_admission_time`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about the metric name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

previous answers based on experience in the field.
-->
We add a metric: `kubelet_topology_manager_admission_time` for kubelet, which can be used to check if the setting is causing unacceptable performance drops.
An existing metric: `kubelet_topology_manager_admission_time` for kubelet can be used to check if the setting is causing unacceptable performance drops.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyclinder could you kindly check you are using the latest KEP template? https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Just curious about the current state of e2e test for this feature.

Do we already have tests in the codebase to ensure that this policy option works as expected and is compatible with other Topology Manager policy options or planned as part of GA graduation?

@cyclinder
Copy link
Contributor Author

I can only imagine as possible improvement to extend that metric.

Agree, do you mean that we need another metric for this option? We already have an existing metric: topology_manager_admission_duration_ms now.

Do we already have tests in the codebase to ensure that this policy option works as expected and is compatible with other Topology Manager policy options or planned as part of GA graduation?

We already have an e2e test for this in kubernetes/kubernetes#124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L257

- "@klueska"
- "@ffromani"
approvers:
- "@sig-node-tech-leads"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be @klueska to match the OWNERS file in this directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. updated.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.

1.34:

- promote to GA
- cannot be disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably means LockToDefault: true. OK for me and remove in 1.35. The feature gate enables a flag disabled by default which is opt-in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

will rollout across nodes.
-->

Rollout or rollout fail do not impact already running workloads, only impact the new workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a general additional comment. This is subtler. Rolling out this option would mean removing the flag from the kubelet. Depending on the hardware, this will prevent the kubelet to start. I don't think rollout concerns really apply to this specific feature.
If you have a a machine wiht 9+ NUMA nodes AND you want to use a topology manager policy which is not None, THEN you need this option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I will update the KEP to address your comments.

- Testing: Are there any tests for failure mode? If not, describe why.
-->

Setting a value lower 8 causes kubelet crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe improve it like:

keeping the default value will cause the kubelet to fail to start on machines with 9 or more NUMA cells if any but the `none` topology manager policy is also configured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@cyclinder
Copy link
Contributor Author

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.

I am open to having some other e2e tests, Do we need to do this in 1.34 or 1.35?

@ffromani
Copy link
Contributor

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests. This should be easy, cheap enough and will give us enough coverage. I can't say if this warrants a beta2 or can be done in the context of the GA graduation.

I am open to having some other e2e tests, Do we need to do this in 1.34 or 1.35?

from what I collected, it's fair and accepted to add e2e tests in the same cycle on which we graduate to beta. But everything should be set before to move to GA.

@ffromani
Copy link
Contributor

looks like the label lead-opt-in was missing, so this PR didn't got PRR review?
Let's identify all the gaps from sig-node perspective, work on them and aim for early trivial GA move next cycle

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to #5242 my understanding is we need to have another beta to complete the missing e2e tests. Other than that LGTM me. Not adding the label because the PR wants to move to GA, and turns out we need a beta2 to address the gaps

Comment on lines +103 to +111

<!--
This section is for explicitly listing the motivation, goals, and non-goals of
this KEP. Describe why the change is important and the benefits to users. The
motivation section can optionally provide links to [experience reports] to
demonstrate the interest in a KEP within the wider Kubernetes community.
[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you resubmit, please remove comments like this from the completed sections.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reminder.

@cyclinder
Copy link
Contributor Author

according to #5242 my understanding is we need to have another beta to complete the missing e2e tests. Other than that LGTM me. Not adding the label because the PR wants to move to GA, and turns out we need a beta2 to address the gaps

According to my understanding, I need to submit another PR, similar to #5242, and also add an e2e test to Kubernetes. Only after these are completed can this PR receive the required label and be merged, right?

@cyclinder
Copy link
Contributor Author

I think we can have some e2e tests which happen to set this value and then run a subset of the existing topology manager tests

We already have an e2e test for this in kubernetes/kubernetes#124148. see https://github.com/kubernetes/kubernetes/blob/0154f8a222c38d5808f1aecf218379cedf15e8a7/test/e2e_node/topology_manager_test.go#L1449. . This is the most basic verification that the changed setting is not breaking stuff for maxNUMANodes(16), Do we still need an additional e2e test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants