-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GPUs #3164
Comments
it looks hard to generalize https://docs.docker.com/config/containers/resource_constraints/#access-an-nvidia-gpu , it seems you need some prerequisites and also run some nvidia code |
Yes, put setting out a path for even making it possible (with some elbow grease) would be appreciated. |
how this is supposed to be used, a kind cluster with gpus ... asking from the ignorance |
I think we're still lacking a portable standard for this and a use case that can't be solved with extraMount / the existing |
Could you consider allowing this to be added for now? There is a lot of value to end-users that don't want to fork Kind just to get GPU support. I was going through this blog post and feel like lacking a portable standard isn't a strong enough reason to have to fork. A fork is even less portable. |
https://github.com/jacobtomlinson/kind/pull/1/files it does not seem as hard to maintain |
Is passing through all GPUs expected to remain desirable? we do need some limited abstraction that we can retarget to podman and maybe nerdctl in the future. a bool that simply passes all GPUs is one possible abstraction, it's not clear if it's a good one or if podman can support it As per our contributing guide some specific proposal must be discussed and we should be looking at podman as well, that PR doesn't appear to. |
If we have to investigate podman etc ourselves this will remain on the backlog below pressing bug fixes like #3223. If anyone is interested in helping move this along then please outline the approach to portability. Not just between runtimes but of the configurations across hosts. The --gpu all mode in docker looks reasonable enough. It's better than hardcoding devices into config. But it seems somewhat unlikely to satisfy all users, and it doesn't look like podman has an equivilant at a glance so at the very least we need to address that with an error in the podman driver and config doc comment ... happy to consider a specific plan that looks into these. Otherwise we'll write one later, tricky fixes haven't left much time to research this particular low-demand feature. |
Got it. Makes sense. Thanks for sharing the details again of what's holding it back. I will try to make some time for this and come up with a proposal that also covers how other runtimes should be handled. |
I am also being affected by this: using GPUs is a particularly common use-case when developing AI/ML workloads, so this is a limitation for using FWIW: I would be happy with the intermediate state where we can enable |
podman and nercdtl seems to support the same
sources:
@BenTheElder what's the way to submit a proposal? You got an example of a specific format that should be used? Are you fine of the proposal of only supporting "all" mode initially for both podman and docker? when users ask for support specific GPUs then that can be added later. In the implementation I will ensure it's using a |
This implements kubernetes-sigs#3164
I mean this was not feasible earlier in the project, the underlying support in docker/containerd was not there in a way that would work reasonably with kind.
Do you all use this with a single node/cluster and just share all to it? What does usage look like?
Thanks, but what we actually need is to identify what options are supportable, if I get a bit more time I'll go search/read but I'd appreciate if someone else interested in this feature outlined that.
I get that, I'd just like someone to look at what is potentially doable with podman before we settle on the API, adding fields is cheap, changing them requires a revision and I'd rather not do that because nobody was willing to look at podman. See also:
KIND separates configuration (exporting ports, enabling kubernetes feature gates ...) from environment (HTTP_PROXY ...) as much as possible. Cluster configs are meant to be controlled by something like If we have to go that route we should make it fail with a helpful error. But we should evaluate if we even have to do that. I took a quick look at nerdctl for other reasons recently and it seems to have a compatible
Just an issue discussion outlining how this will work.
Given feedback that all is fine for current interested users so far, we can probably just do a bool and we can change it in the next API revision. But nobody has answered if this field is reasonably implementable with podman and what that would look like. |
I commented earlier how passing all GPUs can be done with podman, see my comment here: #3164 (comment) Was that what you were looking for? @BenTheElder |
Understood, my point is just that right now, since there is not support, people that need this support might skip over We would not be hearing how often this use-case is leading to that outcome, since they are simply not using
Correct, the common use case my developers have is local development with their own GPU using a single-node or two-node (one control plane and one worker on same machine) cluster. I think @samos123 has a better assessment of podman than I do, I am not too familiar with that ecosystem. |
Sounds good.
Yes, thank you! Sorry, Kubernetes CI is hosed this morning, we had a vuln report for something else in SIG Testing + meetings, I skimmed the updates here too quickly.. Let's add the podman flags as well. At first glance it seems unfortunately docker has generic "gpu" while the podman flags assume "nvidia", but in practice docker only supports nvidia currently so we can cross that bridge later ... docker/cli#2063 I think the remaining question is if we should toggle "all" only or if there's a common way to specify devices. It looks like all 3 are now based on CDI (or at least we could focus on CDI and not older ways of injecting GPUs, which also aligns nicely with https://kind.sigs.k8s.io/docs/design/principles/#target-cri-functionality). It looks like maybe the formats for values other than "all" are different between: podman: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html Unfortunately the specifications look different for all of these but it looks possible to translate at least "all", "0,1,2", "GPU-$uuid,$GPU-uuid" format values to all of these Something like "0,1,2" => parse and emit flags like:
But starting with a string field that only supports "all" with validation for now sounds like a very reasonable start and we can epxand to those later. It does look like to expand beyond "all" we'll have to shim the device format but not heavily, we should definitely NOT plumb the string straight through + validate that the input format is one we can handle so we can expand it later without conflict. |
Thanks @BenTheElder for sharing the proposal. I will continue working on this based on the feedback you gave. Here is the draft PR #3257 I will add validation and podman support and ask you for review when done. |
My use case would require more than the This would be very useful to me:
Going to checkout PR #3257 and see if that works for me. |
based on #3257 (comment) and subsequent discussion I think we want to add an API that looks like the --device mode |
Please see #3257 (comment) for a discussion on how to enable GPU support today without any patched needed to |
Hello, I followed the blog to add gpu support in kind cluster. When install
it seems kind cluster does not mount /proc/driver/nvidia/capabilities to container. How to solve this problem? |
You probably want to follow the directions from @klueska instead of my old blog these days. |
@jacobtomlinson thanks a lot for your post! It is sad to see your blog is not working anymore? I mixed your blog post example with @klueska steps, but got Do you have any suggestions? |
@xihajun it sounds like things are working but you have a mismatch between the CUDA runtime version in the container you are using and the driver version on your machine. |
Hi @jacobtomlinson Thanks a lot, yeah, I think I managed to fix it using this tutorial: https://www.substratus.ai/blog/kind-with-gpus/ |
For some reason only some libraries are mounted into node and specifically I am missing libenvidia-encode.so .
In node container:
On host:
I don see the reason why libnvidia-encode is missing in node. Welcome any idea. |
I guess another option is to implement it as an API call ContainerCreate for DeviceRequests
Alternately qbo community edition supports kind images with Kubeflow, Nvidia kubernetes operator and cgroups v2 (including wsl2 systems) See comment here: NVIDIA/k8s-device-plugin#332 (comment) |
I spent some time putting this together today. It provides a good intro for how to use GPUs in |
What would you like to be added: GPU support.
Why is this needed: Development of AI/ML models in kind.
https://docs.docker.com/config/containers/resource_constraints/
Related/previous issues:
The text was updated successfully, but these errors were encountered: