How to mount GPU devices correctly in nsjail? #237

radkris-git · 2024-08-12T22:51:10Z

Hi, I'm trying to run a simple "pytorch tensor add" on GPU under nsjail on a GCP nvidia-tesla-t4 node and i'm getting the following error.

nsjail_pytorch.cfg

mount {
  src: "/home/current_user_ldap/pytorch_env"
  dst: "/home/current_user_ldap/pytorch_env"
  is_bind: true
}
mount {
  src: "/dev/nvidia0"
  dst: "/dev/nvidia0"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidiactl"
  dst: "/dev/nvidiactl"
  is_bind: true
  rw: true
}
mount {
  src: "/dev/nvidia-uvm"
  dst: "/dev/nvidia-uvm"
  is_bind: true
  rw: true
}
mount {
  src: "/usr"
  dst: "/usr"
  is_bind: true
  rw: true
}
# for libs
mount {
  src: "/lib64"
  dst: "/lib64"
  is_bind: true
}
mount {
  src: "/lib"
  dst: "/lib"
  is_bind: true
  rw: true
}
cwd: "/home/current_user_ldap/pytorch_env/"

Running simple PyTorch Tensor Add on CPU works.

nsjail -Mo --chroot /   --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf   -- /usr/bin/python3 -c "import torch; a = torch.tensor([1.0, 2.0], device='cpu') + torch.tensor([3.0, 4.0], device='cpu'); print(a)"

This prints the expected tensor output of [4, 6]

Running simple PyTorch Tensor Add on GPU fails

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553   --rlimit_fsize inf --rlimit_as inf    -- /usr/bin/python3 -c "import torch; print(torch.cuda.is_available());"

[I][2024-08-10T02:03:04+0000] Mode: STANDALONE_ONCE
[I][2024-08-10T02:03:04+0000] Jail parameters: hostname:'NSJAIL', chroot:'/', process:'/usr/bin/python3', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:600, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2024-08-10T02:03:04+0000] Mount: '/' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/home/current_user_ldap/pytorch_env' -> '/home/current_user_ldap/pytorch_env' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia0' -> '/dev/nvidia0' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidiactl' -> '/dev/nvidiactl' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/dev/nvidia-uvm' -> '/dev/nvidia-uvm' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:false
[I][2024-08-10T02:03:04+0000] Mount: '/usr' -> '/usr' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib64' -> '/lib64' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Mount: '/lib' -> '/lib' flags:MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2024-08-10T02:03:04+0000] Uid map: inside_uid:1002 outside_uid:1002 count:1 newuidmap:false
[I][2024-08-10T02:03:04+0000] Gid map: inside_gid:1003 outside_gid:1003 count:1 newgidmap:false
[I][2024-08-10T02:03:06+0000] Executing '/usr/bin/python3' for '[STANDALONE MODE]'
/home/current_user_ldap/.local/lib/python3.9/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 304: OS call failed or operation not supported on this OS (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False
[I][2024-08-10T02:03:08+0000] pid=28434 ([STANDALONE MODE]) exited with status: 0, (PIDs left: 0)

NVIDIA-SMI runs fine under nsjail

nsjail -Mo --config nsjail_pytorch.cfg  --chroot /  --rlimit_nproc 6553 --rlimit_as inf   -- /bin/nvidia-smi

The above prints, the actual nvidia-smi output successfully.

Notes

PyTorch works fine under nsjail (No issues)
nvidia-smi works under nsjail
Running PyTorch without nsjail on GPU succeeds.

This doesn't look like pytorch or the host issue provided pytorch works on GPU without nsjail. Any help appreciated.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to mount GPU devices correctly in nsjail? #237

How to mount GPU devices correctly in nsjail? #237

radkris-git commented Aug 12, 2024

How to mount GPU devices correctly in nsjail? #237

How to mount GPU devices correctly in nsjail? #237

Comments

radkris-git commented Aug 12, 2024

Running simple PyTorch Tensor Add on CPU works.

Running simple PyTorch Tensor Add on GPU fails

NVIDIA-SMI runs fine under nsjail

Notes