Skip to content

Conversation

@gadididi
Copy link
Contributor

@gadididi gadididi commented Oct 29, 2025

Describe what this PR does

Adds QoS (Quality of Service) support for NVMe-oF CSI driver, allowing users to set IOPS and bandwidth limits on volumes at creation time and modify them dynamically at runtime.

Is there anything that requires special attention

New StorageClass parameters:

  • rwIosPerSecond - Read/Write IOPS limit
  • rwMbytesPerSecond - Read/Write bandwidth limit (MB/s)
  • rMbytesPerSecond - Read bandwidth limit (MB/s)
  • wMbytesPerSecond - Write bandwidth limit (MB/s)

Key decisions:

  • Uses NVMe-oF-specific parameter names (not RBD QoS keys) because the architectures are different: details here
  • Implements ControllerModifyVolume() RPC for runtime QoS changes via VolumeAttributesClass (GA at K8s 1.34+
    you can read more about it here: volume-attributes-classes)
  • Adds stub ControllerExpandVolume() with EXPAND_VOLUME capability to satisfy csi-resizer requirements (csi-resizer needs this capability to start, even when only using it for modify operations)
  • Uses GetControllerPublishSecretRef() for credential retrieval in ControllerModifyVolume() as a temporary solution (TODO: implement GetControllerExpandSecretRef() with proper NVMeoF support in ceph-csi-config)

Credential handling note:
Since CSI spec doesn't define csi.storage.k8s.io/controller-modify-secret-name parameter and csi-resizer doesn't pass secrets to ControllerModifyVolume() according to storageclass-secrets, . The implementation retrieves credentials from ceph-csi-config ConfigMap using the existing controllerPublishSecretRef field. This follows the same pattern as other CSI operations.

Example workflow:

  1. Create StorageClass with:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-nvmeof-qos
provisioner: nvmeof.csi.ceph.com
parameters:
  clusterID: "cluster-1"
  pool: "mypool"
  # ... other params
  1. Create default QoS
apiVersion: storage.k8s.io/v1
kind: VolumeAttributesClass
metadata:
  name: default
driverName: nvmeof.csi.ceph.com
parameters:
  rwIopsPerSecond: "5000"
  1. Create PVC (gets 10000 IOPS, 100 MB/s):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: csi-nvmeof-qos
  volumeAttributesClassName: default  # ← Start with default (5000 IOPS)
  1. Later, define VolumeAttributesClass for runtime modification:
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
  name: high-performance
driverName: nvmeof.csi.ceph.com
parameters:
  nvmeof_rw_ios_per_second: "50000"
  nvmeof_rw_mbytes_per_second: "500"
  1. Modify existing PVC (triggers ControllerModifyVolume):
kubectl patch pvc my-pvc -p '{"spec":{"volumeAttributesClassName":"high-performance"}}'

Result: Volume QoS updated to 50000 IOPS, 500 MB/s without recreation or downtime.

NOTE:

  • Adds csi-resizer sidecar to controller deployment according to this: resizer

Related issues

#5694

Future concerns

  • Add proper NVMeoF struct to ceph-csi-config-map ClusterInfo with dedicated controllerExpandSecretRef field OR Add controllerExpandSecretRef for the existing RBD struct. (because NVMeoF is just a wrapper to RBD).
  • Implement GetControllerExpandSecretRef() to replace temporary use of GetControllerPublishSecretRef()

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

@gadididi gadididi self-assigned this Oct 29, 2025
@gadididi gadididi added the component/nvme-of Issues and PRs related to NVMe-oF. label Oct 29, 2025
@gadididi gadididi marked this pull request as ready for review October 30, 2025 12:35
@gadididi gadididi requested review from Madhu-1 and nixpanic October 30, 2025 12:35
// Step 3: Get NVMe-oF metadata for the volume

// Since ControllerModifyVolume doesn't receive volume context and dont have option to take secrets
// because there is no "csi.storage.k8s.io/controller-modify-secret-name" field in the SC !,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControllerModifyVolume can have secrets according to the specification: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllermodifyvolume

Do you mean that the resizer sidecar does not pass the secret from Kubernetes to the CSI-driver? Is there an issue for that at https://github.com/kubernetes-csi/external-resizer , or can we provide a PR for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I saw it has that, but when I ran it the secret was nil, so I thought maybe I need to add into StorageClass

  csi.storage.k8s.io/controller-modify-secret-name: csi-rbd-secret
  csi.storage.k8s.io/controller-modify-secret-namespace: default

like I have for ControllerPublishVolume()

  csi.storage.k8s.io/controller-publish-secret-name: csi-rbd-secret
  csi.storage.k8s.io/controller-publish-secret-namespace: default

but what I see here:
https://kubernetes-csi.github.io/docs/secrets-and-credentials-storage-class.html#storageclass-secrets
there is no for modify. maybe I am wrong, but I could not find any place that mentioned that..

So, I implemented like ControllerUnPublishVolume() here:

secretName, secretNamespace, err := util.GetControllerPublishSecretRef(req.GetVolumeId(), util.RBDType)

About the resizer, I am not sure how it should pass the secret. I will look at it and see if it is unimplemented\issue or something else..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secrets are indeed not fetched, they are always nil:

https://github.com/kubernetes-csi/external-resizer/blob/cba05cb7cc4d06d7c06ae95dc8941fdcb75bd67e/pkg/modifier/csi_modifier.go#L84-L89

I do not think it is possible to get secrets as part of the ControllerModifyVolume call, so these need to be resolved through an other way (like the new default credentials).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe kubernetes-csi/external-resizer#544 is acceptable, would you be able to test with that container-image?

// For now it only updates the capacity in the response as NVMe-oF
// this must be added because ControllerModifyVolume requires the sidecar csi-resizer. and
// csi-resizer searches for the capacity ControllerServiceCapability_RPC_EXPAND_VOLUME.
// In the future, if NVMe-oF gateway supports volume expansion, the logic must be added here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a shortcoming of the external-resizer sidecar. Only ModifyVolume should be allowed. But there seems to be a required check on volume resize: https://github.com/kubernetes-csi/external-resizer/blob/master/pkg/resizer/csi_resizer.go#L59

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic i think we should open a tracker with the external-resizer as there could be drivers that support ModiyVolume but not Expand volume.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created kubernetes-csi/external-resizer#545 for that, but a draft for now as I have not tested it yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged. The ControllerExpandVolume procedure can be kept until the external-resizer has a release. (Or maybe by that time nvmeof implements resizing too.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic , Aviv told me we should implement the ControllerExpandVolume() (resize nvmeof volume). So, in any case we need this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,ControllerExpandVolume needs to be implemented too, but in a separate PR 😄

@gadididi gadididi force-pushed the nvmeof/add_qos branch 4 times, most recently from 2ade41c to 57c22b1 Compare November 6, 2025 17:06
@gadididi gadididi requested a review from nixpanic November 6, 2025 17:07
@gadididi gadididi force-pushed the nvmeof/add_qos branch 2 times, most recently from c49f033 to 4635779 Compare November 10, 2025 15:25
RwIosPerSecond = "nvmeof_rw_ios_per_second"
RwMbytesPerSecond = "nvmeof_rw_mbytes_per_second"
RMbytesPerSecond = "nvmeof_r_mbytes_per_second"
WMbytesPerSecond = "nvmeof_w_mbytes_per_second"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that we do not have any parameters in the StorageClass that contain under_scores. Please use camelCase format for parameter keys. There is no need for the nvmeof prefix either, these parameters should only be documented for the NVMe-oF controller/provisioner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lets use CephCSI specifc keys for user facing configuration and translate it internall to keep the keys standard

Copy link
Contributor Author

@gadididi gadididi Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok , I will use camelCase format.
FYI, these params will not be shown in the SC anymore, only in VAC yaml file.
I added an example in this PR here:
#5714 (comment)

I think if we have 2 different types of VACs , for rbd and for nvmeof (once the rbd will implement the modify).
the VAC has field driverName: nvmeof.csi.ceph.com, it means only PVCs that created by nvmeof driver can be affected by this VAC. BUT I can create VAC for "nvmeof PVC" that has rbd qos params.
So I must somehow distinguish between these vars.
I can remove the nvmeof prefix, to be e.g. RwIosPerSecond but it is very general, is not?
in this VAC I need to know what params are for nvmeof and what for rbd.
with the nvmeof prefix it will be clearner IMO , what do you think? @nixpanic , @Madhu-1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gadididi lets take a step back, do we need to support RBD QOS for the rbd images created using nvmeof csi driver? what are the advantages of it? if there is no value we should not support it, just support QOS using nvmeof only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1, I did not see specific requirement for that. but I think if there is already implementation of QoS in the RBD controller, and there is an option to set rbd QoS on nvmeof volumes (via rbd cli on the rbd image), maybe is it good to have this option here (in nvmeof controller too)

UPDATE:
I was wrong about the RBD QoS
I asked my manager Aviv, and he said, it is not recommended to have both together (as I said before) but also it is not recommended to allow to the user to set QoS on rbd , when the user uses nvmeof volumes. the purpose of the nvmeof is expose higher level of setting

So, what I am going to do:

1. Parse mutable_parameters
2. Check if contains RBD QoS params?
   - YES: Return INVALID_ARGUMENT (RBD QoS not supported for NVMe-oF volumes)
   - NO: Continue
3. Check if contains NVMe-oF QoS params?
   - NO: Return SUCCESS (no-op, nothing to modify)
   - YES: Continue
4. Apply NVMe-oF QoS via gateway
5. If gateway error EXIST: means rbd QoS was set (not recommended for the user , but he did it ) return ErrorExistst
6. if gateway return another error :  Return INTERNAL
7. else Return SUCCESS

what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove the nvmeof prefix, to be e.g. RwIosPerSecond but it is very general, is not?

Yes, the name of the parameter is general. But the StorageClass is for NVMe-oF, so it is still specific enough that way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good and it also keep very minimal and exactly what we need to have

// No QoS parameters - nothing to do
log.DebugLog(ctx, "No QoS parameters provided for volume %s", volumeID)

return &csi.ControllerModifyVolumeResponse{}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:- If the values are unset or removed from the VAC, dont we need to remove these QOS from the volume as well. In other words how a user can remove QOS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 , Removing volumeAttributesClassName doesn't trigger ControllerModifyVolume()
So, how to remove QoS (means set to unlimited all params), 2 options:

  1. create removal_qos VAC with empty params. once I get it (in the ControllerModifyVolume()) by patching the new removal_qos VAC , I see no params at all - means set to unlimited.

  2. create removal_qos VAC with 0 params to all 4 fields. once I get it (in the ControllerModifyVolume()) by patching the new removal_qos VAC , it calls as always to the grpc set_qos .

currently code support option 2. if I get empty VAC (no params) it finishes with no error
IMO option 2 is better. what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing volumeAttributesClassName doesn't trigger ControllerModifyVolume()

Is this something supported? we need to test with kubernetes 1.34, If yes yes option 2 make sense to me. we need to document these behaviour to set the right expectations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create removal_qos VAC with 0 params to all 4 fields

Call the VAC unlimited or similar (in the documentation or examples), makes it easier for users to understand.

return nil
}

if err := parseParam(nvmeof.RwIosPerSecond, nvmeof.RwIosPerSecond, &qos.RwIosPerSecond); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be the behavior if 0 is set for these keys. is it considered as infinite or disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is considered as infinite. but "disable QoS" means remove the limitation, so I think it is same terms

RwIosPerSecond = "nvmeof_rw_ios_per_second"
RwMbytesPerSecond = "nvmeof_rw_mbytes_per_second"
RMbytesPerSecond = "nvmeof_r_mbytes_per_second"
WMbytesPerSecond = "nvmeof_w_mbytes_per_second"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lets use CephCSI specifc keys for user facing configuration and translate it internall to keep the keys standard

@gadididi gadididi force-pushed the nvmeof/add_qos branch 3 times, most recently from 3717609 to dd1172a Compare November 12, 2025 11:35
@gadididi gadididi force-pushed the nvmeof/add_qos branch 3 times, most recently from 4f707dd to 9abf91b Compare November 12, 2025 13:58
@gadididi gadididi requested review from Madhu-1 and nixpanic November 13, 2025 08:57
Copy link
Collaborator

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits, LGTM

@mergify
Copy link
Contributor

mergify bot commented Nov 16, 2025

This pull request has been removed from the queue for the following reason: checks failed.

The merge conditions cannot be satisfied due to failing checks:

You can check the last failing draft PR here: #5756.

You may have to fix your CI before adding the pull request to the queue again.
If you update this pull request, to fix the CI, it will automatically be requeued once the queue conditions match again.
If you think this was a flaky issue instead, you can requeue the pull request, without updating it, by posting a @mergifyio requeue comment.

@mergify mergify bot added dequeued and removed queued labels Nov 16, 2025
init that file. in the future add more nvmeof errors to this file.

Signed-off-by: gadi-didi <[email protected]>
make getNVMeoFMetadata() returns nvmeof error codes.

Signed-off-by: gadi-didi <[email protected]>
Qos for nvmeof namespace is added. allow the user limit the
namesapce capabilities.

Signed-off-by: gadi-didi <[email protected]>
added because want to reuse the private function parseQosParams()

Signed-off-by: gadi-didi <[email protected]>
if qos params are provided, call to qos grpc after ns was created.

Signed-off-by: gadi-didi <[email protected]>
the purpose of that function is to modify the qos for namesapce on the fly.

Signed-off-by: gadi-didi <[email protected]>
modify_volume capability is added to nvmeof driver in order to call ControllerModifyVolume().

Signed-off-by: gadi-didi <[email protected]>
Add EXPAND_VOLUME capability and stub implementation to allow
csi-resizer to start and handle VolumeAttributesClass modifications.

Signed-off-by: gadi-didi <[email protected]>
add ControllerModifyVolume to request ID extraction for proper log correlation.

Signed-off-by: gadi-didi <[email protected]>
add unit tests with multiple cases.

Signed-off-by: gadi-didi <[email protected]>
@mergify mergify bot dismissed stale reviews from Madhu-1 and nixpanic November 17, 2025 09:11

Pull request has been modified.

@mergify mergify bot removed the dequeued label Nov 17, 2025
@gadididi
Copy link
Contributor Author

@Mergifyio requeue

@mergify
Copy link
Contributor

mergify bot commented Nov 17, 2025

requeue

✅ The queue state of this pull request has been cleaned. It can be re-embarked automatically

@gadididi gadididi requested review from Madhu-1 and nixpanic November 17, 2025 09:54
@nixpanic
Copy link
Member

@Mergifyio queue

@mergify
Copy link
Contributor

mergify bot commented Nov 18, 2025

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 0c9e11e

@mergify mergify bot added the queued label Nov 18, 2025
mergify bot added a commit that referenced this pull request Nov 18, 2025
@mergify mergify bot merged commit 0c9e11e into ceph:devel Nov 18, 2025
17 checks passed
@mergify mergify bot removed the queued label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/nvme-of Issues and PRs related to NVMe-oF.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants