Support GPUs #3164

tmc · 2023-04-11T17:22:54Z

What would you like to be added: GPU support.

Why is this needed: Development of AI/ML models in kind.

https://docs.docker.com/config/containers/resource_constraints/

Related/previous issues:

aojea · 2023-04-11T17:26:48Z

it looks hard to generalize https://docs.docker.com/config/containers/resource_constraints/#access-an-nvidia-gpu , it seems you need some prerequisites and also run some nvidia code

tmc · 2023-04-13T00:45:46Z

it looks hard to generalize https://docs.docker.com/config/containers/resource_constraints/#access-an-nvidia-gpu , it seems you need some prerequisites and also run some nvidia code

Yes, put setting out a path for even making it possible (with some elbow grease) would be appreciated.

aojea · 2023-04-13T13:03:13Z

how this is supposed to be used, a kind cluster with gpus ... asking from the ignorance

BenTheElder · 2023-04-17T16:28:29Z

I think we're still lacking a portable standard for this and a use case that can't be solved with extraMount / the existing --privileged

samos123 · 2023-05-21T03:27:36Z

Could you consider allowing this to be added for now?

There is a lot of value to end-users that don't want to fork Kind just to get GPU support. I was going through this blog post and feel like lacking a portable standard isn't a strong enough reason to have to fork. A fork is even less portable.

aojea · 2023-05-21T11:20:17Z

https://github.com/jacobtomlinson/kind/pull/1/files

it does not seem as hard to maintain

BenTheElder · 2023-05-21T16:45:02Z

Is passing through all GPUs expected to remain desirable?

we do need some limited abstraction that we can retarget to podman and maybe nerdctl in the future.

a bool that simply passes all GPUs is one possible abstraction, it's not clear if it's a good one or if podman can support it

As per our contributing guide some specific proposal must be discussed and we should be looking at podman as well, that PR doesn't appear to.

BenTheElder · 2023-05-21T16:51:36Z

If we have to investigate podman etc ourselves this will remain on the backlog below pressing bug fixes like #3223.

If anyone is interested in helping move this along then please outline the approach to portability. Not just between runtimes but of the configurations across hosts.

The --gpu all mode in docker looks reasonable enough. It's better than hardcoding devices into config. But it seems somewhat unlikely to satisfy all users, and it doesn't look like podman has an equivilant at a glance so at the very least we need to address that with an error in the podman driver and config doc comment ...

happy to consider a specific plan that looks into these. Otherwise we'll write one later, tricky fixes haven't left much time to research this particular low-demand feature.

samos123 · 2023-05-21T17:17:48Z

Got it. Makes sense. Thanks for sharing the details again of what's holding it back. I will try to make some time for this and come up with a proposal that also covers how other runtimes should be handled.

psigen · 2023-05-28T13:41:08Z

I am also being affected by this: using GPUs is a particularly common use-case when developing AI/ML workloads, so this is a limitation for using kind to locally test workloads destined for e.g. GKE. I suspect some users may have simply decided they cannot use kind at all due to the limitation, which might skew visible measures of demand.

FWIW: --gpu all mode in docker would work fine for all of my use-cases, I don't need this support to extend to podman. Here is a thread that links to several of ongoing podman discussions around GPU support: containers/podman#11088 .

I would be happy with the intermediate state where we can enable --gpu all mode in docker and simply document that this does not work in podman. While other users might have more specific needs, having some functionality is a strictly better state for those users as well.

samos123 · 2023-05-29T23:50:52Z

podman and nercdtl seems to support the same all kind of behavior:

podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
nerdctl run -it --rm --gpus all nvidia/cuda:9.0-base nvidia-smi

sources:

Is passing through all GPUs expected to remain desirable?
I think we should allow "all" and in future support "gpu[0-8]" for passing an individual GPU but majority of people would simply use "all".

@BenTheElder what's the way to submit a proposal? You got an example of a specific format that should be used?

Are you fine of the proposal of only supporting "all" mode initially for both podman and docker? when users ask for support specific GPUs then that can be added later. In the implementation I will ensure it's using a string instead of bool such that eventually people could specify gpu0 instead of all

This implements kubernetes-sigs#3164

BenTheElder · 2023-05-30T15:52:24Z

I am also being affected by this: using GPUs is a particularly common use-case when developing AI/ML workloads, so this is a limitation for using kind to locally test workloads destined for e.g. GKE. I suspect some users may have simply decided they cannot use kind at all due to the limitation, which might skew visible measures of demand.

I mean this was not feasible earlier in the project, the underlying support in docker/containerd was not there in a way that would work reasonably with kind.

FWIW: --gpu all mode in docker would work fine for all of my use-cases, I don't need this support to extend to podman.

Do you all use this with a single node/cluster and just share all to it?

What does usage look like?

Here is a thread that links to several of ongoing podman discussions around GPU support: containers/podman#11088 .

Thanks, but what we actually need is to identify what options are supportable, if I get a bit more time I'll go search/read but I'd appreciate if someone else interested in this feature outlined that.

I would be happy with the intermediate state where we can enable --gpu all mode in docker and simply document that this does not work in podman. While other users might have more specific needs, having some functionality is a strictly better state for those users as well.

I get that, I'd just like someone to look at what is potentially doable with podman before we settle on the API, adding fields is cheap, changing them requires a revision and I'd rather not do that because nobody was willing to look at podman.

See also:

KIND separates configuration (exporting ports, enabling kubernetes feature gates ...) from environment (HTTP_PROXY ...) as much as possible. Cluster configs are meant to be controlled by something like make e2e-test in projects using kind, such that they just work on different contributor machines. Ideally they should support all node runtimes. If it's impossible to do that that's new territory for the project.

If we have to go that route we should make it fail with a helpful error. But we should evaluate if we even have to do that.

I took a quick look at nerdctl for other reasons recently and it seems to have a compatible --gpu flag so at least only podman should need further investigation.

@BenTheElder what's the way to submit a proposal? You got an example of a specific format that should be used?

Just an issue discussion outlining how this will work.
https://kind.sigs.k8s.io/docs/contributing/getting-started/

Are you fine of the proposal of only supporting "all" mode initially for both podman and docker? when users ask for support specific GPUs then that can be added later. In the implementation I will ensure it's using a string instead of bool such that eventually people could specify gpu0 instead of all

Given feedback that all is fine for current interested users so far, we can probably just do a bool and we can change it in the next API revision.

But nobody has answered if this field is reasonably implementable with podman and what that would look like.

samos123 · 2023-05-30T16:12:38Z

But nobody has answered if this field is reasonably implementable with podman and what that would look like.

I commented earlier how passing all GPUs can be done with podman, see my comment here: #3164 (comment)

Was that what you were looking for? @BenTheElder

psigen · 2023-05-30T16:27:54Z

I mean this was not feasible earlier in the project, the underlying support in docker/containerd was not there in a way that would work reasonably with kind.

Understood, my point is just that right now, since there is not support, people that need this support might skip over kind as an option. For example, some of my colleagues have pivoted to using k3d for local development instead: https://k3d.io/v5.3.0/usage/advanced/cuda/

We would not be hearing how often this use-case is leading to that outcome, since they are simply not using kind (but I think they would like to if they could! 😄 )

Do you all use this with a single node/cluster and just share all to it?
What does usage look like?

Correct, the common use case my developers have is local development with their own GPU using a single-node or two-node (one control plane and one worker on same machine) cluster.

I think @samos123 has a better assessment of podman than I do, I am not too familiar with that ecosystem.

BenTheElder · 2023-05-30T20:13:22Z

Correct, the common use case my developers have is local development with their own GPU using a single-node or two node (one control plane and one worker on same machine) cluster.

Sounds good.

I commented earlier how passing all GPUs can be done with podman, see my comment here:
#3164 (comment)

Was that what you were looking for? @BenTheElder

Yes, thank you!

Sorry, Kubernetes CI is hosed this morning, we had a vuln report for something else in SIG Testing + meetings, I skimmed the updates here too quickly..

Let's add the podman flags as well.

At first glance it seems unfortunately docker has generic "gpu" while the podman flags assume "nvidia", but in practice docker only supports nvidia currently so we can cross that bridge later ... docker/cli#2063

I think the remaining question is if we should toggle "all" only or if there's a common way to specify devices.

It looks like all 3 are now based on CDI (or at least we could focus on CDI and not older ways of injecting GPUs, which also aligns nicely with https://kind.sigs.k8s.io/docs/design/principles/#target-cri-functionality).

It looks like maybe the formats for values other than "all" are different between:

podman: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
docker: https://docs.docker.com/engine/reference/commandline/run/#gpus
nerdctl: https://github.com/containerd/nerdctl/blob/main/docs/gpu.md

Unfortunately the specifications look different for all of these but it looks possible to translate at least "all", "0,1,2", "GPU-$uuid,$GPU-uuid" format values to all of these

Something like "0,1,2" => parse and emit flags like:

docker run --gpus=device=0,1,2
podman run --device nvidia.com/gpu=gpu0 --device nvidia.com/gpu=gpu1 --device nvidia.com/gpu=gpu2
nerdctl run --gpus=device=0,1,2

But starting with a string field that only supports "all" with validation for now sounds like a very reasonable start and we can epxand to those later. It does look like to expand beyond "all" we'll have to shim the device format but not heavily, we should definitely NOT plumb the string straight through + validate that the input format is one we can handle so we can expand it later without conflict.

samos123 · 2023-05-30T20:30:33Z

Thanks @BenTheElder for sharing the proposal. I will continue working on this based on the feedback you gave. Here is the draft PR #3257

I will add validation and podman support and ask you for review when done.

lukeogg · 2023-06-05T22:21:40Z

My use case would require more than the all toggle. I use a machine with multiple GPUs that would need to be mapped to different nodes for ML training. I use this with Kubeflow to write and simulate distributed training.

This would be very useful to me:

Something like "0,1,2" => parse and emit flags like:

docker run --gpus=device=0,1,2
podman run --device nvidia.com/gpu=gpu0 --device nvidia.com/gpu=gpu1 --device nvidia.com/gpu=gpu2
nerdctl run --gpus=device=0,1,2

Going to checkout PR #3257 and see if that works for me.

BenTheElder · 2023-06-06T20:26:17Z

based on #3257 (comment) and subsequent discussion I think we want to add an API that looks like the --device mode

klueska · 2023-06-26T11:48:25Z

Please see #3257 (comment) for a discussion on how to enable GPU support today without any patched needed to kind.

jiangxiaobin96 · 2023-12-18T07:05:31Z

Hello, I followed the blog to add gpu support in kind cluster. When install k8s-device-plugin and met problem following

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: stat failed: /proc/driver/nvidia/capabilities: no such file or directory: unknown

it seems kind cluster does not mount /proc/driver/nvidia/capabilities to container. How to solve this problem?

jacobtomlinson · 2023-12-19T17:37:25Z

You probably want to follow the directions from @klueska instead of my old blog these days.

xihajun · 2023-12-20T17:07:10Z

You probably want to follow the directions from @klueska instead of my old blog these days.

Prev | Next

@jacobtomlinson thanks a lot for your post! It is sad to see your blog is not working anymore? I mixed your blog post example with @klueska steps, but got Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

Do you have any suggestions?

jacobtomlinson · 2024-01-02T14:15:25Z

@xihajun it sounds like things are working but you have a mismatch between the CUDA runtime version in the container you are using and the driver version on your machine.

xihajun · 2024-01-02T17:11:17Z

@xihajun it sounds like things are working but you have a mismatch between the CUDA runtime version in the container you are using and the driver version on your machine.

Hi @jacobtomlinson Thanks a lot, yeah, I think I managed to fix it using this tutorial: https://www.substratus.ai/blog/kind-with-gpus/

madimadi · 2024-01-07T17:46:33Z

@xihajun it sounds like things are working but you have a mismatch between the CUDA runtime version in the container you are using and the driver version on your machine.

Hi @jacobtomlinson Thanks a lot, yeah, I think I managed to fix it using this tutorial: https://www.substratus.ai/blog/kind-with-gpus/

For some reason only some libraries are mounted into node and specifically I am missing libenvidia-encode.so .

$ grep capabilities /etc/nvidia-container-runtime/config.toml
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"

In node container:

 docker exec -it kind-worker  ldconfig -p | grep libnvidia
        libnvidia-ptxjitcompiler.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
        libnvidia-pkcs11-openssl3.so.535.129.03 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.129.03
        libnvidia-opencl.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
        libnvidia-nvvm.so.4 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
        libnvidia-ml.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
        libnvidia-cfg.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
        libnvidia-allocator.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1

On host:

$ ldconfig -p | grep libnvidia-encode
        libnvidia-encode.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-encode.so.1
        libnvidia-encode.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-encode.so

I don see the reason why libnvidia-encode is missing in node. Welcome any idea.

alexeadem · 2024-01-30T21:19:14Z

I guess another option is to implement it as an API call

ContainerCreate for DeviceRequests

DeviceRequests": [
{
"Driver": "nvidia",
"Count": -1,
"DeviceIDs"": [],
"Capabilities": [],
"Options": {}
}
],

Alternately qbo community edition supports kind images with Kubeflow, Nvidia kubernetes operator and cgroups v2 (including wsl2 systems) See comment here: NVIDIA/k8s-device-plugin#332 (comment)

klueska · 2024-03-27T22:41:38Z

I spent some time putting this together today. It provides a good intro for how to use GPUs in kind today:
https://github.com/klueska/kind-with-gpus-examples

tmc added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 11, 2023

samos123 added a commit to samos123/kind that referenced this issue May 30, 2023

Add GPU support for docker

8ca69b3

This implements kubernetes-sigs#3164

samos123 mentioned this issue May 30, 2023

Add GPU support #3257

Closed

BenTheElder added this to the v0.20.0 milestone May 30, 2023

BenTheElder assigned samos123 and BenTheElder May 30, 2023

psigen mentioned this issue Jun 25, 2023

Add GPU flag support for Docker provider jacobtomlinson/kind#1

Open

lukeogg linked a pull request Jun 28, 2023 that will close this issue

Add API for CDI --devices flag in Docker and Podman for mapping GPUs #3290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GPUs #3164

Support GPUs #3164

tmc commented Apr 11, 2023 •

edited

Loading

aojea commented Apr 11, 2023

tmc commented Apr 13, 2023

aojea commented Apr 13, 2023

BenTheElder commented Apr 17, 2023

samos123 commented May 21, 2023

aojea commented May 21, 2023

BenTheElder commented May 21, 2023 •

edited

Loading

BenTheElder commented May 21, 2023 •

edited

Loading

samos123 commented May 21, 2023

psigen commented May 28, 2023 •

edited

Loading

samos123 commented May 29, 2023

BenTheElder commented May 30, 2023

samos123 commented May 30, 2023

psigen commented May 30, 2023 •

edited

Loading

BenTheElder commented May 30, 2023

samos123 commented May 30, 2023

lukeogg commented Jun 5, 2023

BenTheElder commented Jun 6, 2023

klueska commented Jun 26, 2023

jiangxiaobin96 commented Dec 18, 2023

jacobtomlinson commented Dec 19, 2023

xihajun commented Dec 20, 2023 •

edited

Loading

jacobtomlinson commented Jan 2, 2024

xihajun commented Jan 2, 2024

madimadi commented Jan 7, 2024

alexeadem commented Jan 30, 2024

klueska commented Mar 27, 2024

Support GPUs #3164

Support GPUs #3164

Comments

tmc commented Apr 11, 2023 • edited Loading

aojea commented Apr 11, 2023

tmc commented Apr 13, 2023

aojea commented Apr 13, 2023

BenTheElder commented Apr 17, 2023

samos123 commented May 21, 2023

aojea commented May 21, 2023

BenTheElder commented May 21, 2023 • edited Loading

BenTheElder commented May 21, 2023 • edited Loading

samos123 commented May 21, 2023

psigen commented May 28, 2023 • edited Loading

samos123 commented May 29, 2023

BenTheElder commented May 30, 2023

samos123 commented May 30, 2023

psigen commented May 30, 2023 • edited Loading

BenTheElder commented May 30, 2023

samos123 commented May 30, 2023

lukeogg commented Jun 5, 2023

BenTheElder commented Jun 6, 2023

klueska commented Jun 26, 2023

jiangxiaobin96 commented Dec 18, 2023

jacobtomlinson commented Dec 19, 2023

xihajun commented Dec 20, 2023 • edited Loading

jacobtomlinson commented Jan 2, 2024

xihajun commented Jan 2, 2024

madimadi commented Jan 7, 2024

alexeadem commented Jan 30, 2024

klueska commented Mar 27, 2024

tmc commented Apr 11, 2023 •

edited

Loading

BenTheElder commented May 21, 2023 •

edited

Loading

BenTheElder commented May 21, 2023 •

edited

Loading

psigen commented May 28, 2023 •

edited

Loading

psigen commented May 30, 2023 •

edited

Loading

xihajun commented Dec 20, 2023 •

edited

Loading