-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cuda12 variant of tensorflow-notebook #2100
Conversation
Could you please fix tests? |
SHELL ["/bin/bash", "-o", "pipefail", "-c"] | ||
|
||
# Install CUDA libs and cuDNN with mamba | ||
RUN mamba install --yes -c nvidia/label/cuda-12.3.2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance not to hardcode minor.patch version here?
Also, 12.4 was released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 12.3, because it is listed in the tested build configurations. There seems to be no label like cuda-12
. Without label, i. e. -c nvidia
I got the latest CUDA version, which is 12.4, currently. It worked, but I thought it is risky. A new CUDA release might be incompatible (I guess not before 13.x) and it seems we do not have a unit test e. g. to check the output of python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
. Should we add a unit test like this?
Secondly, the pytorch-notebook fixes the minor version as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used 12.3, because it is listed in the tested build configurations.
There is a chance that the table might be slightly outdated.
There seems to be no label like cuda-12. Without label, i. e. -c nvidia I got the latest CUDA version, which is 12.4, currently.
Can we use something like nvidia/label/cuda-12.*
?
Should we add a unit test like this?
I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.
Also, we currently don't have a way to run a test for variant
image (but it's not difficult to add something like this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a chance that the table might be slightly outdated.
I looked at the table just a few hours after the release and it was up to date. But it won't ever be tested against a newer version.
Can we use something like nvidia/label/cuda-12.*?
No, there is no such label. Here is a list.
I am not sure if this test will work with a regular GitHub-hosted ubuntu runner.
Right, that is always problematic. Sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use something like nvidia/label/cuda-12.*?
No, there is no such label. Here is a list.
I meant maybe mamba supports label regex (I have no idea if it does or not).
In that case we will be fine with existing tags and won’t need to hardcode some particular version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite using labels, I noticed all package versions are also in the main label. The version in the list of labels is a bit strange. So the labels are formatted in major.minor.patch and the version in major.minor.build. Usually, if patch increments, the build number just continues:
- label: cuda-11.6.0, version: 11.6.55
- label: cuda-11.6.1, version: 11.6.112
- label: cuda-11.6.2, version: 11.6.124
There is one exception: labels cuda-11.4.3 and cuda-11.4.4 both have version 11.4.152.
Nevertheless, I just tried:
mamba install -c nvidia 'cuda-nvcc<13'
and got version 12.4.99 (which is also in label cuda-12.4.0)mamba install -c nvidia 'cuda-nvcc<12'
and got version 11.8.89 (which is also in label cuda-11.8.0)mamba install -c nvidia 'cuda-nvcc<11.5'
and got version 11.4.152 (which is also in labels cuda-11.4.3 and cuda-11.4.4)mamba install -c nvidia 'cuda-nvcc=12.3'
and got version 12.3.107 (which is also in label cuda-12.3.2)
So, we could use 'cuda-nvcc<13'
, to reduce maintenance work (avoid updating the version for every TensorFlow release), but these are not officially tested by TensorFlow (not sure, if there can occur incompatibilities with new minor versions). Using something like 'cuda-nvcc=12.3'
is more work (still avoiding the patch version), but officially tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with cuda-nvcc=12.3
, please add the comment why we choose this version though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did not work, since the dependencies have no limitation in their versions. So using cuda-nvcc=12.3
with the nvidia channel resulted in a mixture of 12.3 and 12.4. NVIDIA is quite sloppy in their packaging.
Then I noticed, that cudnn from the nvidia channel is outdated. Apparently they dropped the support 3 years ago. The cudnn from conda-forge is quite up-to-date, but there is no CUDA 12 build yet, only CUDA 11.8.
So, I would like to go with the new installation method supported by TensorFlow, which is basically just pip install tensorflow[and-cuda]
. This also has the advantage, that the installed CUDA version is always the officially tested version – so less maintenance for you. Usually the path to the nvidia libs should be found automatically, but in 2.16.1 there seems to be a bug, so we have to add them. I prepared an activation script for that. Also LD_LIBRARY_PATH
is not polluted with this method, because the paths from the pip installation contain only the nvidia libs. Before we added ${CONDA_DIR}/lib/
, which contains quite a lot libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let’s try it
NVIDIA_DRIVER_CAPABILITIES="compute,utility" | ||
|
||
# Install Tensorflow with pip | ||
RUN pip install --no-cache-dir tensorflow && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use tensorflow from mamba?
I think in such a case we won't even need to list the dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The current version of
tensorflow-gpu
on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.
This is the kind of complexity I recall when getting a tensorflow gpu image working, where the conda-forge version is often outdated, and trying to install with pip
for gpu support was complicated.
I think if we choose between install complexity or relying on something regularly outdated, the install complexity may be prefered - otherwise we introduce things we can't control. At the same time, the fact that its outdated etc relates to how complicated it may be to keep installing something that works over time, which we then may be taking on.
This PR will probably demonstrate the current maturity of tensorflow gpu stuff upstream, if its as bad as I experienced it was a while back, then I think its better to not try to try maintain a tensorflow gpu image to avoid making this project too hard to maintain as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was very complicated and the conda-forge tensorflow-gpu
package helped by providing cudatoolkit
and cudnn
within the same toolchain. But I just tried to install tensorflow-gpu
from conda-forge into scipy-notebook and it fails due to conflicts. So maybe it changed and nowadays the installation of tensorflow
from PyPI and cuda
and cudnn
from nvidia's conda-channel is the easiest way. For maintenance I imagine using the cuda version from the tested builds for a new TensorFlow release should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we can't tell if the current maturity of tensorflow packages is better until we merge this PR, have weekly builds, and have several releases of cuda/cudnn and the tensorflow package itself.
So, I am ok with how Dockerfile currently looks like, but we need to see if it's gonna be ok after a few releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ready to give it a try as a maintainer (and I can always disable the build just by changing a few lines in the docker.yml
config-like file).
@mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message) We can have a discussion until the deadline, so please express your opinions. As this is very similar to the |
Co-authored-by: Ayaz Salikhov <[email protected]>
Co-authored-by: Ayaz Salikhov <[email protected]>
I voted 👀 for now, I'd like to see that this seems reasonable to install and maintain long term by tensorflow gpu packages upstream make it easy enough, because it has been a mess historically in my experience and I don't want this project to take on maintaining function if its too messy. If the implementation looks not-messy, I'd be 👍, but I think this may be a very notable commitment if it is, and that we better then protect the project limited maintenance capacity from taking on such maintenance burden. |
Co-authored-by: Ayaz Salikhov <[email protected]>
I cleaned aarch64 machines, builds should work better - unfortunately, docker is the worst in cleaning its cache. |
I don't feel qualified to give a 👍 or 👎, @consideRatio has already highlighted the main issues around long-term maintainability so I think the decision should be from those who have ultimate responsibility for maintaining it. |
I spend some time to find the best way regarding maintainability. Now it looks quite similar to the PyTorch cuda variant, except that for TensorFlow:
|
Why can't you pin the version when using |
The extra "and-cuda" is defined here. There the package versions of the dependencies are fixed to the ones listed in tested build configuration. There is no additional "and-cuda11" extra. That's what I meant. |
I guess there are many libraries installed alongside with tensorflow. At least I think it is worth trying. |
I tried to use the However, it does not work. The Full import errors with TF_CPP_MAX_VLOG_LEVEL=3
If you look at the conda-forge |
Thanks. I guess in that case let's rename the variant to simple |
@yuvipanda what do you think about this PR? |
While I have no official vote here, I would like to express my full support for this PR. Given that TensorFlow is mainly used for GPU-accelerated applications, it makes a lot of sense to have a GPU-capable docker image available. The current non-GPU enabled images feel like a neutered alternative that are OK for doing some preliminary exploration on how TensorFlow works, but are ineffective to be used in any real-world applications. |
Sorry for the delay, @mathbunnyru. I'm +1 on this change because it's using the upstream supported way to install tensorflow - the The only (non-blocking) concern I have is that it's based on the scipy-notebook image, which installs packages primarily from conda-forge. And So overall, +1 from me. Thank you for this contribution, @ChristofKaufmann! And thanks for your stewardship, @mathbunnyru |
We can always pin versions in some images if we need to and there is no other choice. I think So, let's try to merge this one 🙂 |
Big big +1! Thank you :) |
Thank you for your helpful comments to improve the code @mathbunnyru |
Describe your changes
tensorflow-cpu
wheel now (to reduce size of the image).tensorflow-gpu
on conda-forge is 2.15.0 and has a CUDA 11.8 build. So if you want a cuda11 variant, I can try to use the conda-forge version for that, but the TensorFlow version is not up-to-date.Issue ticket if applicable
Fix: #2095, #1557.
Checklist (especially for first-time contributors)