Unload nvidia_drm module for changing Mig mode#301
Unload nvidia_drm module for changing Mig mode#301JunAr7112 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
a0c0ac6 to
e025799
Compare
cmd/nvidia-mig-parted/util/device.go
Outdated
| if err != nil { | ||
| return "", fmt.Errorf("error getting GPU pci bus IDs: %v", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Unnecessary newline.
cmd/nvidia-mig-parted/util/device.go
Outdated
|
|
||
| cmd := exec.Command("nvidia-smi", "-r", "-i", strings.Join(pciBusIDs, ",")) //nolint:gosec | ||
| // Unload nvidia_drm module and its dependencies before reset to release GPU references | ||
| modprobeCmd := exec.Command("sudo", "modprobe", "-r", "nvidia_drm") |
There was a problem hiding this comment.
Does this assume the user has sudo access?
There was a problem hiding this comment.
Yes. I will remove this. I don't think we need to handle root permissions here
e025799 to
1fc6e07
Compare
|
@JunAr7112 Can you fix the lint failure and add testing steps? |
1fc6e07 to
12b4993
Compare
Signed-off-by: Arjun <[email protected]>
12b4993 to
97806a1
Compare
|
/ok-to-test 97806a1 |
| } | ||
|
|
||
| // Unload nvidia_drm module and its dependencies before reset to release GPU references | ||
| modprobeCmd := exec.Command("modprobe", "-r", "nvidia_drm") //nolint:gosec |
There was a problem hiding this comment.
I don't believe this is the right place (or only place) for this. Note, nvmlResetAllGPUs() is only called inResetAllGPUs() which in turn is only called in ApplyMigMode(). Based on the bug description, we need to unload this module across all MIG reconfigurations, not just a mode change.
There was a problem hiding this comment.
The current implementation addresses the case where a GPU reset is performed (e.g. when changing the MIG mode). But what about when the current MIG mode is correct, but we are attempting to apply a new MIG configuration? Does the mig-parted apply still work when the nvidia_drm module is loaded, or are we required to unload it first?
|
@JunAr7112 can you add testing details to the PR description? Demonstrate that you have reproduced the original issue and that it is fixed with the changes in this PR. |
No description provided.