runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

msherm2 · 2025-01-27T21:50:47Z

OS: RHEL 9.3
GPU: NVIDIA L4
GPU Operator: v24.9.1
containerd: 1.6.28
rke2: 1.26.10
nvidia-container-toolkit: 1.17.3
environment: air-gapped

I am able to deploy the helm chart with the following settings using v23.9.1:

helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
gpu-operator-v23.9.1.tgz $HELM_OPTIONS \
--set driver.enabled=true \
--set nfd.enabled=false \
--set gfd.enabled=false \
--set operator.defaultRuntime="containerd" \
--set toolkit.enabled=false \
--set devicePlugin.enabled=true \
--set toolkit.enabled=true \
--set toolkit.version=v1.17.3-ubi8 \
--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
--set-string validator.driver.env[0].value="true" \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config2.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value="true"

However, when I attempt to install v24.9.1, gpu-operator pod will not run correctly:
[Update] I am also having this same result with v24.6.2

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  32s                default-scheduler  Successfully assigned gpu-operator/gpu-operator-86c586f6df-vchlw to <node-name>
  Normal   Pulled     11s (x3 over 31s)  kubelet            Container image "nvcr.io/nvidia/gpu-operator:v24.9.1" already present on machine
  Normal   Created    11s (x3 over 31s)  kubelet            Created container gpu-operator
  Warning  Failed     11s (x3 over 31s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown
  Warning  BackOff    1s (x5 over 29s)   kubelet            Back-off restarting failed container gpu-operator in pod gpu-operator-86c586f6df-vchlw_gpu-operator(a99e3e8d-47d8-47c9-b3fd-db6d796fa0d5)

Mon Jan 27 16:02:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:8B:00.0 Off |                    0 |
| N/A   38C    P8             17W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

msherm2 commented Jan 27, 2025 •

edited

Loading

runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

Comments

msherm2 commented Jan 27, 2025 • edited Loading

msherm2 commented Jan 27, 2025 •

edited

Loading