Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4876: Mutable CSINode Allocatable Property #4875

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

torredil
Copy link
Member

@torredil torredil commented Sep 25, 2024

  • One-line PR description: This PR adds a new KEP for Mutable CSINode Allocatable Property.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Sep 25, 2024
@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 25, 2024
@torredil torredil force-pushed the kep3301 branch 2 times, most recently from 64ae3c9 to b03c3a3 Compare October 3, 2024 17:09
@torredil torredil force-pushed the kep3301 branch 3 times, most recently from 5b85c07 to cac3da7 Compare October 8, 2024 00:15
@torredil torredil changed the title KEP-3301: Mutable CSINode Allocatable Property KEP-4876: Mutable CSINode Allocatable Property Oct 8, 2024
// the CSINode allocatable capacity for this driver. If not set, periodic updates
// are disabled, and updates occur only upon detecting capacity-related failures.
// +optional
AllocatableUpdateInterval *metav1.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a fixed interval period only? I think it'd be useful to do this with exp backoff too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a fixed interval period only. I'm not sure that it would make sense to update this value using an exponential backoff strategy, but open to the idea.

This updated logic allows the `Allocatable.Count` field to be modified when the feature gate is enabled, while ensuring all
other fields remain immutable. When the feature gate is disabled, it falls back to the existing validation logic for backward
compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another validation is the entities that can update the field, only the kubelet should update it and not end users, so that might mean a change in the kube-apiserver as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with eddie offline, AFAIK this is done in the node authorizer

plugin/pkg/auth/authorizer/node/node_authorizer.go:                     return r.authorizeCSINode(nodeName, attrs)
plugin/pkg/auth/authorizer/node/node_authorizer.go:// authorizeCSINode authorizes node requests to CSINode storage.k8s.io/csinodes
plugin/pkg/auth/authorizer/node/node_authorizer.go:func (r *NodeAuthorizer) authorizeCSINode(nodeName string, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
plugin/pkg/auth/authorizer/node/node_authorizer.go:             return authorizer.DecisionNoOpinion, "can only get, create, update, patch, or delete a CSINode", nil
plugin/pkg/auth/authorizer/node/node_authorizer.go:             return authorizer.DecisionNoOpinion, "cannot authorize CSINode subresources", nil
plugin/pkg/auth/authorizer/node/node_authorizer.go:     // 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've been thinking more about this... if I understand the ask correctly, it is to ensure that only kubelet can update CSINode objects, and prevent any other entity - including privileged users - from updating this object.

It seems that the node authorizer is designed to authorize API requests made by kubelets, not restrict requests from other sources. In other words, no change on the node authorizer could possibly address @mauriciopoppe's request (and, no changes are needed on the node authorizer side because it already allows updating the CSINode object, as per @aojea's finding above : )

I'm curious to know if there is precedent before we continue down this path. I believe we don't only allow kubelet to create CSINode objects, wouldn't it make sense to have also enforced this restriction for object creation as well? It seems odd to allow privileged entities the ability to create but not update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any APIs where we restrict the users. The only distinction is cluster-level API vs namespace-scoped APIs.

@alculquicondor
Copy link
Member

/cc

Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR is sufficient for alpha, but I have some scalability concerns and would like to see some quick back of the envelope math for load and what we can do to mitigate it.

The `ResourceExhausted` error is directly reported on the `VolumeAttachment` object associated with the relevant attachment.

```golang
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent countless hours exploring this path when writing the KEP, and the main issue here is that we need to query a CSI driver's NodeGetInfo endpoint to retrieve the updated allocatable value before updating the CSNode object. That operation can't be done on the KCM side, it needs to be done on the node side :(

Besides this, it would be helpful to create an event on the pod during pod construction to let cluster operators know that pod has failed to come up due to a volume attachment error as a result of insufficient capacity. The idea there is that cluster operators or https://github.com/kubernetes-sigs/descheduler could leverage this event to "remove" the stateful pods that are stuck in ContainerCreating so that they may be scheduled to a different node that does have capacity.

Would love to hear your recommendation for a path forward here.

@deads2k
Copy link
Contributor

deads2k commented Oct 10, 2024

PRR looks good and the scalability section is valuable. We'll have to build it out a little more as we gain experience in alpha.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: deads2k, torredil
Once this PR has been reviewed and has the lgtm label, please assign saad-ali for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


- Modifying the core scheduling logic of Kubernetes.
- Implementing cloud provider-specific solutions within Kubernetes core.
- Re-scheduling pods stuck in a `ContainerCreating` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be re-considered in beta, as it sounds like a major pain point for users.

But this is more a question for SIG Node (cc @dchen1107 @SergeyKanzhelev)


- Modifying the core scheduling logic of Kubernetes.
- Implementing cloud provider-specific solutions within Kubernetes core.
- Re-scheduling pods stuck in a `ContainerCreating` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it something that we can check on pod admission instead? As we do with Device Plugin when we allocate devices on admission and then can use them later without the risk of failing the container creation

// are disabled, and updates occur only upon detecting capacity-related failures.
// The minimum allowed value for this field is 10 seconds.
// +optional
NodeAllocatableUpdatePeriodSeconds *metav1.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use a stream for CSIDriver to report back changes? Same way as we do Device Health in Device Plugin?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the CSI RPCs support streaming IIRC.

Signed-off-by: torredil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.