Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown #1235

Open
msherm2 opened this issue Jan 27, 2025 · 0 comments

Comments

@msherm2
Copy link

msherm2 commented Jan 27, 2025

OS: RHEL 9.3
GPU: NVIDIA L4
GPU Operator: v24.9.1
containerd: 1.6.28
rke2: 1.26.10
nvidia-container-toolkit: 1.17.3
environment: air-gapped

I am able to deploy the helm chart with the following settings using v23.9.1:

helm install --wait gpu-operator \
-n gpu-operator --create-namespace \
gpu-operator-v23.9.1.tgz $HELM_OPTIONS \
--set driver.enabled=true \
--set nfd.enabled=false \
--set gfd.enabled=false \
--set operator.defaultRuntime="containerd" \
--set toolkit.enabled=false \
--set devicePlugin.enabled=true \
--set toolkit.enabled=true \
--set toolkit.version=v1.17.3-ubi8 \
--set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION \
--set-string validator.driver.env[0].value="true" \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config2.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value="true"

However, when I attempt to install v24.9.1, gpu-operator pod will not run correctly:
[Update] I am also having this same result with v24.6.2

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  32s                default-scheduler  Successfully assigned gpu-operator/gpu-operator-86c586f6df-vchlw to <node-name>
  Normal   Pulled     11s (x3 over 31s)  kubelet            Container image "nvcr.io/nvidia/gpu-operator:v24.9.1" already present on machine
  Normal   Created    11s (x3 over 31s)  kubelet            Created container gpu-operator
  Warning  Failed     11s (x3 over 31s)  kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "gpu-operator": executable file not found in $PATH: unknown
  Warning  BackOff    1s (x5 over 29s)   kubelet            Back-off restarting failed container gpu-operator in pod gpu-operator-86c586f6df-vchlw_gpu-operator(a99e3e8d-47d8-47c9-b3fd-db6d796fa0d5)

Mon Jan 27 16:02:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:8B:00.0 Off |                    0 |
| N/A   38C    P8             17W /   72W |       1MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant