-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4876: Mutable CSINode Allocatable Property #4875
base: master
Are you sure you want to change the base?
Conversation
torredil
commented
Sep 25, 2024
•
edited
Loading
edited
- One-line PR description: This PR adds a new KEP for Mutable CSINode Allocatable Property.
- Issue link: Mutable CSINode Allocatable Property #4876
- Other comments:
64ae3c9
to
b03c3a3
Compare
5b85c07
to
cac3da7
Compare
// the CSINode allocatable capacity for this driver. If not set, periodic updates | ||
// are disabled, and updates occur only upon detecting capacity-related failures. | ||
// +optional | ||
AllocatableUpdateInterval *metav1.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a fixed interval period only? I think it'd be useful to do this with exp backoff too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a fixed interval period only. I'm not sure that it would make sense to update this value using an exponential backoff strategy, but open to the idea.
This updated logic allows the `Allocatable.Count` field to be modified when the feature gate is enabled, while ensuring all | ||
other fields remain immutable. When the feature gate is disabled, it falls back to the existing validation logic for backward | ||
compatibility. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another validation is the entities that can update the field, only the kubelet should update it and not end users, so that might mean a change in the kube-apiserver as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with eddie offline, AFAIK this is done in the node authorizer
plugin/pkg/auth/authorizer/node/node_authorizer.go: return r.authorizeCSINode(nodeName, attrs)
plugin/pkg/auth/authorizer/node/node_authorizer.go:// authorizeCSINode authorizes node requests to CSINode storage.k8s.io/csinodes
plugin/pkg/auth/authorizer/node/node_authorizer.go:func (r *NodeAuthorizer) authorizeCSINode(nodeName string, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
plugin/pkg/auth/authorizer/node/node_authorizer.go: return authorizer.DecisionNoOpinion, "can only get, create, update, patch, or delete a CSINode", nil
plugin/pkg/auth/authorizer/node/node_authorizer.go: return authorizer.DecisionNoOpinion, "cannot authorize CSINode subresources", nil
plugin/pkg/auth/authorizer/node/node_authorizer.go: //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I've been thinking more about this... if I understand the ask correctly, it is to ensure that only kubelet can update CSINode
objects, and prevent any other entity - including privileged users - from updating this object.
It seems that the node authorizer is designed to authorize API requests made by kubelets, not restrict requests from other sources. In other words, no change on the node authorizer could possibly address @mauriciopoppe's request (and, no changes are needed on the node authorizer side because it already allows updating the CSINode
object, as per @aojea's finding above : )
I'm curious to know if there is precedent before we continue down this path. I believe we don't only allow kubelet to create CSINode
objects, wouldn't it make sense to have also enforced this restriction for object creation as well? It seems odd to allow privileged entities the ability to create but not update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of any APIs where we restrict the users. The only distinction is cluster-level API vs namespace-scoped APIs.
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR is sufficient for alpha, but I have some scalability concerns and would like to see some quick back of the envelope math for load and what we can do to mitigate it.
The `ResourceExhausted` error is directly reported on the `VolumeAttachment` object associated with the relevant attachment. | ||
|
||
```golang | ||
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any consideration for checking in https://github.com/kubernetes/kubernetes/blob/cc67c4cf34d6c6e73458835154d06d0b6012e50f/pkg/volume/csi/csi_attacher.go#L146?
The logic in syncPod is csi-agnostic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent countless hours exploring this path when writing the KEP, and the main issue here is that we need to query a CSI driver's NodeGetInfo
endpoint to retrieve the updated allocatable value before updating the CSNode
object. That operation can't be done on the KCM side, it needs to be done on the node side :(
Besides this, it would be helpful to create an event on the pod during pod construction to let cluster operators know that pod has failed to come up due to a volume attachment error as a result of insufficient capacity. The idea there is that cluster operators or https://github.com/kubernetes-sigs/descheduler could leverage this event to "remove" the stateful pods that are stuck in ContainerCreating
so that they may be scheduled to a different node that does have capacity.
Would love to hear your recommendation for a path forward here.
PRR looks good and the scalability section is valuable. We'll have to build it out a little more as we gain experience in alpha. /approve |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: deads2k, torredil The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
- Modifying the core scheduling logic of Kubernetes. | ||
- Implementing cloud provider-specific solutions within Kubernetes core. | ||
- Re-scheduling pods stuck in a `ContainerCreating` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be re-considered in beta, as it sounds like a major pain point for users.
But this is more a question for SIG Node (cc @dchen1107 @SergeyKanzhelev)
|
||
- Modifying the core scheduling logic of Kubernetes. | ||
- Implementing cloud provider-specific solutions within Kubernetes core. | ||
- Re-scheduling pods stuck in a `ContainerCreating` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it something that we can check on pod admission instead? As we do with Device Plugin when we allocate devices on admission and then can use them later without the risk of failing the container creation
// are disabled, and updates occur only upon detecting capacity-related failures. | ||
// The minimum allowed value for this field is 10 seconds. | ||
// +optional | ||
NodeAllocatableUpdatePeriodSeconds *metav1.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use a stream for CSIDriver to report back changes? Same way as we do Device Health in Device Plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the CSI RPCs support streaming IIRC.
Signed-off-by: torredil <[email protected]>